mbox series

[net-next,v2,0/5] tls: implement key updates for TLS1.3

Message ID cover.1676052788.git.sd@queasysnail.net (mailing list archive)
Headers show
Series tls: implement key updates for TLS1.3 | expand

Message

Sabrina Dubroca Feb. 14, 2023, 11:17 a.m. UTC
This adds support for receiving KeyUpdate messages (RFC 8446, 4.6.3
[1]). A sender transmits a KeyUpdate message and then changes its TX
key. The receiver should react by updating its RX key before
processing the next message.

This patchset implements key updates by:
 1. pausing decryption when a KeyUpdate message is received, to avoid
    attempting to use the old key to decrypt a record encrypted with
    the new key
 2. returning -EKEYEXPIRED to syscalls that cannot receive the
    KeyUpdate message, until the rekey has been performed by userspace
 3. passing the KeyUpdate message to userspace as a control message
 4. allowing updates of the crypto_info via the TLS_TX/TLS_RX
    setsockopts

This API has been tested with gnutls to make sure that it allows
userspace libraries to implement key updates [2]. Thanks to Frantisek
Krenzelok <fkrenzel@redhat.com> for providing the implementation in
gnutls and testing the kernel patches.

Note: in a future series, I'll clean up tls_set_sw_offload and
eliminate the per-cipher copy-paste using tls_cipher_size_desc.

[1] https://www.rfc-editor.org/rfc/rfc8446#section-4.6.3
[2] https://gitlab.com/gnutls/gnutls/-/merge_requests/1625


Changes in v2
use reverse xmas tree ordering in tls_set_sw_offload and
do_tls_setsockopt_conf
turn the alt_crypto_info into an else if
selftests: add rekey_fail test

Vadim suggested simplifying tls_set_sw_offload by copying the new
crypto_info in the context in do_tls_setsockopt_conf, and then
detecting the rekey in tls_set_sw_offload based on whether the iv was
already set, but I don't think we can have a common error path
(otherwise we'd free the aead etc on rekey failure). I decided instead
to reorganize tls_set_sw_offload so that the context is unmodified
until we know the rekey cannot fail. Some fields will be touched
during the rekey, but they will be set to the same value they had
before the rekey (prot->rec_seq_size, etc).

Apoorv suggested to name the struct tls_crypto_info_keys "tls13"
rather than "tls12". Since we're using the same crypto_info data for
TLS1.3 as for 1.2, even if the tests only run for TLS1.3, I'd rather
keep the "tls12" name, in case we end up adding a
"tls13_crypto_info_aes_gcm_128" type in the future.

Kuniyuki and Apoorv also suggested preventing rekeys on RX when we
haven't received a matching KeyUpdate message, but I'd rather let
userspace handle this and have a symmetric API between TX and RX on
the kernel side. It's a bit of a foot-gun, but we can't really stop a
broken userspace from rolling back the rec_seq on an existing
crypto_info either, and that seems like a worse possible breakage.

Sabrina Dubroca (5):
  tls: remove tls_context argument from tls_set_sw_offload
  tls: block decryption when a rekey is pending
  tls: implement rekey for TLS1.3
  selftests: tls: add key_generation argument to tls_crypto_info_init
  selftests: tls: add rekey tests

 include/net/tls.h                 |   4 +
 net/tls/tls.h                     |   3 +-
 net/tls/tls_device.c              |   2 +-
 net/tls/tls_main.c                |  37 +++-
 net/tls/tls_sw.c                  | 189 +++++++++++++----
 tools/testing/selftests/net/tls.c | 336 +++++++++++++++++++++++++++++-
 6 files changed, 511 insertions(+), 60 deletions(-)

Comments

Jakub Kicinski Feb. 15, 2023, 5:08 a.m. UTC | #1
On Tue, 14 Feb 2023 12:17:37 +0100 Sabrina Dubroca wrote:
> Changes in v2
> use reverse xmas tree ordering in tls_set_sw_offload and
> do_tls_setsockopt_conf
> turn the alt_crypto_info into an else if
> selftests: add rekey_fail test
> 
> Vadim suggested simplifying tls_set_sw_offload by copying the new
> crypto_info in the context in do_tls_setsockopt_conf, and then
> detecting the rekey in tls_set_sw_offload based on whether the iv was
> already set, but I don't think we can have a common error path
> (otherwise we'd free the aead etc on rekey failure). I decided instead
> to reorganize tls_set_sw_offload so that the context is unmodified
> until we know the rekey cannot fail. Some fields will be touched
> during the rekey, but they will be set to the same value they had
> before the rekey (prot->rec_seq_size, etc).
> 
> Apoorv suggested to name the struct tls_crypto_info_keys "tls13"
> rather than "tls12". Since we're using the same crypto_info data for
> TLS1.3 as for 1.2, even if the tests only run for TLS1.3, I'd rather
> keep the "tls12" name, in case we end up adding a
> "tls13_crypto_info_aes_gcm_128" type in the future.
> 
> Kuniyuki and Apoorv also suggested preventing rekeys on RX when we
> haven't received a matching KeyUpdate message, but I'd rather let
> userspace handle this and have a symmetric API between TX and RX on
> the kernel side. It's a bit of a foot-gun, but we can't really stop a
> broken userspace from rolling back the rec_seq on an existing
> crypto_info either, and that seems like a worse possible breakage.

And how will we handle re-keying in offload?

>  include/net/tls.h                 |   4 +
>  net/tls/tls.h                     |   3 +-
>  net/tls/tls_device.c              |   2 +-
>  net/tls/tls_main.c                |  37 +++-
>  net/tls/tls_sw.c                  | 189 +++++++++++++----
>  tools/testing/selftests/net/tls.c | 336 +++++++++++++++++++++++++++++-

Documentation please.
Sabrina Dubroca Feb. 15, 2023, 5:29 p.m. UTC | #2
2023-02-14, 21:08:11 -0800, Jakub Kicinski wrote:
> On Tue, 14 Feb 2023 12:17:37 +0100 Sabrina Dubroca wrote:
> > Changes in v2
> > use reverse xmas tree ordering in tls_set_sw_offload and
> > do_tls_setsockopt_conf
> > turn the alt_crypto_info into an else if
> > selftests: add rekey_fail test
> > 
> > Vadim suggested simplifying tls_set_sw_offload by copying the new
> > crypto_info in the context in do_tls_setsockopt_conf, and then
> > detecting the rekey in tls_set_sw_offload based on whether the iv was
> > already set, but I don't think we can have a common error path
> > (otherwise we'd free the aead etc on rekey failure). I decided instead
> > to reorganize tls_set_sw_offload so that the context is unmodified
> > until we know the rekey cannot fail. Some fields will be touched
> > during the rekey, but they will be set to the same value they had
> > before the rekey (prot->rec_seq_size, etc).
> > 
> > Apoorv suggested to name the struct tls_crypto_info_keys "tls13"
> > rather than "tls12". Since we're using the same crypto_info data for
> > TLS1.3 as for 1.2, even if the tests only run for TLS1.3, I'd rather
> > keep the "tls12" name, in case we end up adding a
> > "tls13_crypto_info_aes_gcm_128" type in the future.
> > 
> > Kuniyuki and Apoorv also suggested preventing rekeys on RX when we
> > haven't received a matching KeyUpdate message, but I'd rather let
> > userspace handle this and have a symmetric API between TX and RX on
> > the kernel side. It's a bit of a foot-gun, but we can't really stop a
> > broken userspace from rolling back the rec_seq on an existing
> > crypto_info either, and that seems like a worse possible breakage.
> 
> And how will we handle re-keying in offload?

Sorry for the stupid question... do you mean that I need to solve that
problem before this series can progress, or that the cover letter
should summarize the state of the discussion?

> >  include/net/tls.h                 |   4 +
> >  net/tls/tls.h                     |   3 +-
> >  net/tls/tls_device.c              |   2 +-
> >  net/tls/tls_main.c                |  37 +++-
> >  net/tls/tls_sw.c                  | 189 +++++++++++++----
> >  tools/testing/selftests/net/tls.c | 336 +++++++++++++++++++++++++++++-
> 
> Documentation please.

Ok, I'll add a paragraph to Documentation/networking/tls.rst in the
next version.
Jakub Kicinski Feb. 15, 2023, 7:10 p.m. UTC | #3
On Wed, 15 Feb 2023 18:29:50 +0100 Sabrina Dubroca wrote:
> > And how will we handle re-keying in offload?  
> 
> Sorry for the stupid question... do you mean that I need to solve that
> problem before this series can progress, or that the cover letter
> should summarize the state of the discussion?

I maintain that 1.3 offload is much more important than rekeying.
Offloads being available for 1.2 may be stalling adoption of 1.3
(just a guess, I run across this article mentioning 1.2 being used
in Oracle cloud for instance:
https://blogs.oracle.com/cloudsecurity/post/how-oci-helps-you-protect-data-with-default-encryption
could be because MITM requirements, or maybe they have HW which
can only do 1.2? Dunno).

But I'm willing to compromise, we just need a solid plan of how to
handle the inevitable. I'm worried that how this will pay out is:
 - you don't care about offload and add rekey
 - vendors don't care about rekey and add 1.3
  ... time passes ...
 - both you and the vendors have moved on
 - users run into issues, waste their time debugging and
   eventually report the problem upstream
 - it's on me to fix?

:(
Sabrina Dubroca Feb. 15, 2023, 11:23 p.m. UTC | #4
2023-02-15, 11:10:20 -0800, Jakub Kicinski wrote:
> On Wed, 15 Feb 2023 18:29:50 +0100 Sabrina Dubroca wrote:
> > > And how will we handle re-keying in offload?  
> > 
> > Sorry for the stupid question... do you mean that I need to solve that
> > problem before this series can progress, or that the cover letter
> > should summarize the state of the discussion?
> 
> I maintain that 1.3 offload is much more important than rekeying.

Sure, it'd be great if we had 1.3 offload (and it should also support
rekeying, because you can't force the peer to not rekey). But I can't
implement offload.

> Offloads being available for 1.2 may be stalling adoption of 1.3
> (just a guess, I run across this article mentioning 1.2 being used
> in Oracle cloud for instance:
> https://blogs.oracle.com/cloudsecurity/post/how-oci-helps-you-protect-data-with-default-encryption
> could be because MITM requirements, or maybe they have HW which
> can only do 1.2? Dunno).
> 
> But I'm willing to compromise, we just need a solid plan of how to
> handle the inevitable. I'm worried that how this will pay out is:
>  - you don't care about offload and add rekey

I think that's a bit unfair. Not having to deal with offload at all
would make things easier for me, sure, but I'm open to the discussion,
even if I don't have a good understanding of the offloading side.

I'd just like to avoid holding this feature (arguably a bug fix) until
the vendors finally decide that they care about 1.3, if possible. If
not, so be it.

I wasn't trying to force you to accept this series. Sorry if that's
what it sounded like. I really wanted to understand what you were
asking for, because your question wasn't clear to me. Now it makes
sense.

>  - vendors don't care about rekey and add 1.3
>   ... time passes ...
>  - both you and the vendors have moved on
>  - users run into issues, waste their time debugging and
>    eventually report the problem upstream
>  - it's on me to fix?
> 
> :(

Yeah, I see. If the rekey already exists in SW, I think it'll be a bit
harder for them to just not care about it, but maybe I'm being
optimistic.

I'm not sure we can come up with the correct uAPI/rekey design without
trying to implement rekey with offload and seeing how that blows up
(and possibly in different ways with different devices).

Picking up from where the discussion died off in the previous thread:

On transmit, I think the software fallback for retransmits will be
needed, whether we can keep two generations of keys on the device or
just one. We could have 2 consecutive rekeys, without even worrying
about a broken peer spamming key updates for both sides (or the local
user's library doing that). If devices can juggle 3 generations of
keys, then maybe we don't have to worry too much about software
fallback, but we'll need to define an API to set the extra keys ahead
of time and then advance to the next one. Will all devices support
installing 2 or 3 keys?

On receive, we also have the problem of more than one rekey arriving,
so if we can't feed enough keys to the device in advance, we'll have
to decrypt some records in software. The host will have to survive the
burst of software decryption while we wait until the device config
catches up.

One option might be to do the key derivation in the kernel following
section 7.2 of the RFC [1]. I don't know how happy crypto/security
people would be with that. We'd have to introduce new crypto_info
structs, and new cipher types (or a flag in the upper bits of the
cipher type) to go with them. Then the kernel processes incoming key
update messages on its own, and emits its own key update messages when
its current key is expiring. On transmit we also need to inject a
Finished message before the KeyUpdate [2]. That's bringing a lot of
TLS logic in the kernel. At that point we might as well do the whole
handshake... but I really hope it doesn't come to that.

It's hard, but I don't think "let's just kill the connection" is a
nice solution. It's definitely easier for the kernel.

[1] https://www.rfc-editor.org/rfc/rfc8446#section-7.2
[2] https://www.rfc-editor.org/rfc/rfc8446#section-4.6.3
Jakub Kicinski Feb. 16, 2023, 3:57 a.m. UTC | #5
On Thu, 16 Feb 2023 00:23:11 +0100 Sabrina Dubroca wrote:
> > Offloads being available for 1.2 may be stalling adoption of 1.3
> > (just a guess, I run across this article mentioning 1.2 being used
> > in Oracle cloud for instance:
> > https://blogs.oracle.com/cloudsecurity/post/how-oci-helps-you-protect-data-with-default-encryption
> > could be because MITM requirements, or maybe they have HW which
> > can only do 1.2? Dunno).
> > 
> > But I'm willing to compromise, we just need a solid plan of how to
> > handle the inevitable. I'm worried that how this will pay out is:
> >  - you don't care about offload and add rekey  
> 
> I think that's a bit unfair. Not having to deal with offload at all
> would make things easier for me, sure, but I'm open to the discussion,
> even if I don't have a good understanding of the offloading side.
> 
> I'd just like to avoid holding this feature (arguably a bug fix) until
> the vendors finally decide that they care about 1.3, if possible. If
> not, so be it.
> 
> I wasn't trying to force you to accept this series. Sorry if that's
> what it sounded like. I really wanted to understand what you were
> asking for, because your question wasn't clear to me. Now it makes
> sense.
> 
> >  - vendors don't care about rekey and add 1.3
> >   ... time passes ...
> >  - both you and the vendors have moved on
> >  - users run into issues, waste their time debugging and
> >    eventually report the problem upstream
> >  - it's on me to fix?
> > 
> > :(  
> 
> Yeah, I see. If the rekey already exists in SW, I think it'll be a bit
> harder for them to just not care about it, but maybe I'm being
> optimistic.

True, they may try to weasel out / require some pushing and support.
Depends on which vendor gets to it first, I guess.

> I'm not sure we can come up with the correct uAPI/rekey design without
> trying to implement rekey with offload and seeing how that blows up
> (and possibly in different ways with different devices).

Yes, best we can do now is have a plan in place... and your promise 
of future help? :) (incl. being on the lookout for when the patches 
come because I'll probably forget)

> Picking up from where the discussion died off in the previous thread:
> 
> On transmit, I think the software fallback for retransmits will be
> needed, whether we can keep two generations of keys on the device or
> just one. We could have 2 consecutive rekeys, without even worrying
> about a broken peer spamming key updates for both sides (or the local
> user's library doing that). If devices can juggle 3 generations of
> keys, then maybe we don't have to worry too much about software
> fallback, but we'll need to define an API to set the extra keys ahead
> of time and then advance to the next one. Will all devices support
> installing 2 or 3 keys?

I think we could try to switch to SW crypto on Tx until all data using
old key is ACK'ed, drivers can look at skb->decrypted to skip touching
the transitional skbs. Then remove old key, install new one, resume
offload.

We may need special care to make sure we don't try to encrypt the same
packet with both keys. In case a rtx gets stuck somewhere and comes to
the NIC after it's already acked (happens surprisingly often).

Multiple keys on the device would probably mean the device needs some
intelligence to know when to use which - not my first choice.

> On receive, we also have the problem of more than one rekey arriving,
> so if we can't feed enough keys to the device in advance, we'll have
> to decrypt some records in software. The host will have to survive the
> burst of software decryption while we wait until the device config
> catches up.

I think receive is easier. The fallback is quite effective and already
in place. Here too we may want to enforce some transitional SW-only
mode to avoid the (highly unlikely?) case that NIC will decrypt
successfully a packet with the old key, even tho new key should be used.
Carrying "key ID" with the skb is probably an overkill.

> One option might be to do the key derivation in the kernel following
> section 7.2 of the RFC [1]. I don't know how happy crypto/security
> people would be with that. We'd have to introduce new crypto_info
> structs, and new cipher types (or a flag in the upper bits of the
> cipher type) to go with them. Then the kernel processes incoming key
> update messages on its own, and emits its own key update messages when
> its current key is expiring. On transmit we also need to inject a
> Finished message before the KeyUpdate [2]. That's bringing a lot of
> TLS logic in the kernel. At that point we might as well do the whole
> handshake... but I really hope it doesn't come to that.

I think it's mostly a device vs host state sharing problem, so TLS ULP
or user space - not a big difference, both are on the host.
Sabrina Dubroca Feb. 16, 2023, 4:23 p.m. UTC | #6
2023-02-15, 19:57:48 -0800, Jakub Kicinski wrote:
> On Thu, 16 Feb 2023 00:23:11 +0100 Sabrina Dubroca wrote:
> > > Offloads being available for 1.2 may be stalling adoption of 1.3
> > > (just a guess, I run across this article mentioning 1.2 being used
> > > in Oracle cloud for instance:
> > > https://blogs.oracle.com/cloudsecurity/post/how-oci-helps-you-protect-data-with-default-encryption
> > > could be because MITM requirements, or maybe they have HW which
> > > can only do 1.2? Dunno).
> > > 
> > > But I'm willing to compromise, we just need a solid plan of how to
> > > handle the inevitable. I'm worried that how this will pay out is:
> > >  - you don't care about offload and add rekey  
> > 
> > I think that's a bit unfair. Not having to deal with offload at all
> > would make things easier for me, sure, but I'm open to the discussion,
> > even if I don't have a good understanding of the offloading side.
> > 
> > I'd just like to avoid holding this feature (arguably a bug fix) until
> > the vendors finally decide that they care about 1.3, if possible. If
> > not, so be it.
> > 
> > I wasn't trying to force you to accept this series. Sorry if that's
> > what it sounded like. I really wanted to understand what you were
> > asking for, because your question wasn't clear to me. Now it makes
> > sense.
> > 
> > >  - vendors don't care about rekey and add 1.3
> > >   ... time passes ...
> > >  - both you and the vendors have moved on
> > >  - users run into issues, waste their time debugging and
> > >    eventually report the problem upstream
> > >  - it's on me to fix?
> > > 
> > > :(  
> > 
> > Yeah, I see. If the rekey already exists in SW, I think it'll be a bit
> > harder for them to just not care about it, but maybe I'm being
> > optimistic.
> 
> True, they may try to weasel out / require some pushing and support.
> Depends on which vendor gets to it first, I guess.
> 
> > I'm not sure we can come up with the correct uAPI/rekey design without
> > trying to implement rekey with offload and seeing how that blows up
> > (and possibly in different ways with different devices).
> 
> Yes, best we can do now is have a plan in place... and your promise 
> of future help? :) (incl. being on the lookout for when the patches 
> come because I'll probably forget)

I'll try to help. None of us can guarantee that we'll be around :)

> > Picking up from where the discussion died off in the previous thread:
> > 
> > On transmit, I think the software fallback for retransmits will be
> > needed, whether we can keep two generations of keys on the device or
> > just one. We could have 2 consecutive rekeys, without even worrying
> > about a broken peer spamming key updates for both sides (or the local
> > user's library doing that). If devices can juggle 3 generations of
> > keys, then maybe we don't have to worry too much about software
> > fallback, but we'll need to define an API to set the extra keys ahead
> > of time and then advance to the next one. Will all devices support
> > installing 2 or 3 keys?
> 
> I think we could try to switch to SW crypto on Tx until all data using
> old key is ACK'ed, drivers can look at skb->decrypted to skip touching
> the transitional skbs. Then remove old key, install new one, resume
> offload.

"all data using the old key" needs to be one list of record per old
key, since we can have multiple rekeys.

Could we install the new key in HW a bit earlier? Keep the old key as
SW fallback for rtx, but the driver installs the new key when the
corresponding KeyUpdate record has gone through and tells the stack to
stop doing SW crypto? I'm not sure that'd be a significant improvement
in the standard case, though.

> We may need special care to make sure we don't try to encrypt the same
> packet with both keys. In case a rtx gets stuck somewhere and comes to
> the NIC after it's already acked (happens surprisingly often).

Don't we have that already? If there's a retransmit while we're
setting the TX key in HW, data that was queued on the socket before
(and shouldn't be encrypted at all) would also be encrypted
otherwise. Or is it different with rekey?

> Multiple keys on the device would probably mean the device needs some
> intelligence to know when to use which - not my first choice.

I only mentioned that because Boris talked about it in the previous
thread.  It's also not my first choice because I doubt all devices
will support it the same way.

> > On receive, we also have the problem of more than one rekey arriving,
> > so if we can't feed enough keys to the device in advance, we'll have
> > to decrypt some records in software. The host will have to survive the
> > burst of software decryption while we wait until the device config
> > catches up.
> 
> I think receive is easier. The fallback is quite effective and already
> in place.

I was thinking about a peer sending at line rate just after the rekey,
and we have to handle that in software for a short while. If the peer
is also dropping to a software fallback, I guess there's no issue,
both sides will slow down.

> Here too we may want to enforce some transitional SW-only
> mode to avoid the (highly unlikely?) case that NIC will decrypt
> successfully a packet with the old key, even tho new key should be used.

Not a crypto expert, but I don't think that's a realistic thing to
worry about.

If the NIC uses the old key to decrypt a record that was encrypted
with the new key and fails, we end up seeing the encrypted record and
decrypting it in SW, right? (and then we can pick the correct key in
the SW fallback) We don't just tear down the connection?

> Carrying "key ID" with the skb is probably an overkill.

I think we can retrieve it from the sequence of records. When we see a
KeyUpdate, we advance the "key ID" after we're done with the message.


This makes me wonder again if we should have fake offloads on veth
(still calling the kernel's crypto library to simulate a device doing
the encryption and/or decryption), to make it easy to play with the
software bits, without requiring actual hardware that can offload
TLS/IPsec/MACsec. But maybe it's too complex to be useful and we'd
just waste our time fixing bugs in the fake offload rather than
improving the stack.

> > One option might be to do the key derivation in the kernel following
> > section 7.2 of the RFC [1]. I don't know how happy crypto/security
> > people would be with that. We'd have to introduce new crypto_info
> > structs, and new cipher types (or a flag in the upper bits of the
> > cipher type) to go with them. Then the kernel processes incoming key
> > update messages on its own, and emits its own key update messages when
> > its current key is expiring. On transmit we also need to inject a
> > Finished message before the KeyUpdate [2]. That's bringing a lot of
> > TLS logic in the kernel. At that point we might as well do the whole
> > handshake... but I really hope it doesn't come to that.
> 
> I think it's mostly a device vs host state sharing problem, so TLS ULP
> or user space - not a big difference, both are on the host.

Maybe. But that's one more implementation of TLS (or large parts of
it) that admins and security teams have to worry about. Maintained by
networking people. I really want to avoid going down that route.


I'll try to dig more into what you're proposing and to poke some holes
into it over the next few days.
Jakub Kicinski Feb. 22, 2023, 3:19 a.m. UTC | #7
Sorry for the delay, long weekend + merge window.

On Thu, 16 Feb 2023 17:23:19 +0100 Sabrina Dubroca wrote:
> 2023-02-15, 19:57:48 -0800, Jakub Kicinski wrote:
> > On Thu, 16 Feb 2023 00:23:11 +0100 Sabrina Dubroca wrote:  
> > > I'm not sure we can come up with the correct uAPI/rekey design without
> > > trying to implement rekey with offload and seeing how that blows up
> > > (and possibly in different ways with different devices).  
> > 
> > Yes, best we can do now is have a plan in place... and your promise 
> > of future help? :) (incl. being on the lookout for when the patches 
> > come because I'll probably forget)  
> 
> I'll try to help. None of us can guarantee that we'll be around :)

Right.

> > I think we could try to switch to SW crypto on Tx until all data using
> > old key is ACK'ed, drivers can look at skb->decrypted to skip touching
> > the transitional skbs. Then remove old key, install new one, resume
> > offload.  
> 
> "all data using the old key" needs to be one list of record per old
> key, since we can have multiple rekeys.

No fully parsing this bit.

> Could we install the new key in HW a bit earlier? Keep the old key as
> SW fallback for rtx, but the driver installs the new key when the
> corresponding KeyUpdate record has gone through and tells the stack to
> stop doing SW crypto? I'm not sure that'd be a significant improvement
> in the standard case, though.

Important consideration is making the non-rekey path as fast as
possible (given rekeying is extremely rare). Looking at skb->decrypted
should be very fast but we can possibly fit some other indication of
"are we rekeying" into another already referenced cache line.
We definitely don't want to have to look up the record to know what
state we're in.

The fallback can't use AES-NI (it's in sirq context) so it's slower 
than SW encrypt before queuing to TCP. Hence my first thought is using
SW crypto for new key and let the traffic we already queued with old
key drain leveraging HW crypto. But as I said the impact on performance
when not rekeying is more important, and so is driver simplicity.

> > We may need special care to make sure we don't try to encrypt the same
> > packet with both keys. In case a rtx gets stuck somewhere and comes to
> > the NIC after it's already acked (happens surprisingly often).  
> 
> Don't we have that already? If there's a retransmit while we're
> setting the TX key in HW, data that was queued on the socket before
> (and shouldn't be encrypted at all) would also be encrypted
> otherwise. Or is it different with rekey?

We have a "start marker" record which is supposed to indicate that
anything before it has already been encrypted. The driver is programmed
with the start seq no, when it sees a packet from before this seq no
it checks if a record exists, finds its before the start marker and
sends the data as is.

> > Multiple keys on the device would probably mean the device needs some
> > intelligence to know when to use which - not my first choice.  
> 
> I only mentioned that because Boris talked about it in the previous
> thread.  It's also not my first choice because I doubt all devices
> will support it the same way.
> 
> > > On receive, we also have the problem of more than one rekey arriving,
> > > so if we can't feed enough keys to the device in advance, we'll have
> > > to decrypt some records in software. The host will have to survive the
> > > burst of software decryption while we wait until the device config
> > > catches up.  
> > 
> > I think receive is easier. The fallback is quite effective and already
> > in place.  
> 
> I was thinking about a peer sending at line rate just after the rekey,
> and we have to handle that in software for a short while. If the peer
> is also dropping to a software fallback, I guess there's no issue,
> both sides will slow down.
> 
> > Here too we may want to enforce some transitional SW-only
> > mode to avoid the (highly unlikely?) case that NIC will decrypt
> > successfully a packet with the old key, even tho new key should be used.  
> 
> Not a crypto expert, but I don't think that's a realistic thing to
> worry about.
> 
> If the NIC uses the old key to decrypt a record that was encrypted
> with the new key and fails, we end up seeing the encrypted record and
> decrypting it in SW, right? (and then we can pick the correct key in
> the SW fallback) We don't just tear down the connection?

Right, in the common case it should just work. We're basically up
against the probability of a hash collision on the auth tag (match 
with both old and new key).

> > Carrying "key ID" with the skb is probably an overkill.  
> 
> I think we can retrieve it from the sequence of records. When we see a
> KeyUpdate, we advance the "key ID" after we're done with the message.
> 
> 
> This makes me wonder again if we should have fake offloads on veth
> (still calling the kernel's crypto library to simulate a device doing
> the encryption and/or decryption), to make it easy to play with the
> software bits, without requiring actual hardware that can offload
> TLS/IPsec/MACsec. But maybe it's too complex to be useful and we'd
> just waste our time fixing bugs in the fake offload rather than
> improving the stack.

It should be quite useful. I also usually just hack up veth, but 
I reckon adding support to netdevsim would be a better fit.
We just need a way to tell two netdevsim ports to "connect to each
other".

> > > One option might be to do the key derivation in the kernel following
> > > section 7.2 of the RFC [1]. I don't know how happy crypto/security
> > > people would be with that. We'd have to introduce new crypto_info
> > > structs, and new cipher types (or a flag in the upper bits of the
> > > cipher type) to go with them. Then the kernel processes incoming key
> > > update messages on its own, and emits its own key update messages when
> > > its current key is expiring. On transmit we also need to inject a
> > > Finished message before the KeyUpdate [2]. That's bringing a lot of
> > > TLS logic in the kernel. At that point we might as well do the whole
> > > handshake... but I really hope it doesn't come to that.  
> > 
> > I think it's mostly a device vs host state sharing problem, so TLS ULP
> > or user space - not a big difference, both are on the host.  
> 
> Maybe. But that's one more implementation of TLS (or large parts of
> it) that admins and security teams have to worry about. Maintained by
> networking people. I really want to avoid going down that route.
> 
> 
> I'll try to dig more into what you're proposing and to poke some holes
> into it over the next few days.
Sabrina Dubroca Feb. 23, 2023, 4:27 p.m. UTC | #8
2023-02-21, 19:19:44 -0800, Jakub Kicinski wrote:
> Sorry for the delay, long weekend + merge window.

No worries, I wasn't expecting much activity on this from you during
the merge window.

> On Thu, 16 Feb 2023 17:23:19 +0100 Sabrina Dubroca wrote:
> > 2023-02-15, 19:57:48 -0800, Jakub Kicinski wrote:
> > > I think we could try to switch to SW crypto on Tx until all data using
> > > old key is ACK'ed, drivers can look at skb->decrypted to skip touching
> > > the transitional skbs. Then remove old key, install new one, resume
> > > offload.  
> > 
> > "all data using the old key" needs to be one list of record per old
> > key, since we can have multiple rekeys.
> 
> No fully parsing this bit.

We can have multiple rekeys in the time it takes to get an ACK for the
first KeyUpdate message to be ACK'ed. I'm not sure why I talked about
a "list of records".

But we could have this sequence of records:

  recN(k1,hwenc)
  KeyUpdate(k1,hwenc)
  // switch to k2 and sw crypto

  rec0(k2,swenc)
  rec1(k2,swenc)
  KeyUpdate(k2,swenc)
  rec0(k3,swenc)
  // receive ACK for KU1, don't enable HW offload for k2 or k3 because we've already switched off k2

  rec1(k3,swenc)
  // receive ACK for KU2, now enable HW offload for k3

  rec2(k3,hwenc)

So we'll need to record the most recent TX rekey, and wait until the
corresponding KU record is ACK'ed, before we resume offload using the
most recent key (and skip possible intermediate keys).

Installing the key in HW and re-enabling the offload will need to
happen via the icsk_clean_acked callback. We'll need a workqueue so
that we don't actually talk to the driver from softirq.

Then, we have to handle a failure to install the key. Since we're not
installing it in HW immediately during setsockopt, notifying userspace
of a rekey failure is more complicated. Maybe we can do a
rekey_prepare during the setsocktopt, and then the actual rekey is an
operation that cannot fail?

> > Could we install the new key in HW a bit earlier? Keep the old key as
> > SW fallback for rtx, but the driver installs the new key when the
> > corresponding KeyUpdate record has gone through and tells the stack to
> > stop doing SW crypto? I'm not sure that'd be a significant improvement
> > in the standard case, though.
> 
> Important consideration is making the non-rekey path as fast as
> possible (given rekeying is extremely rare). Looking at skb->decrypted
> should be very fast but we can possibly fit some other indication of
> "are we rekeying" into another already referenced cache line.
> We definitely don't want to have to look up the record to know what
> state we're in.
> 
> The fallback can't use AES-NI (it's in sirq context) so it's slower 
> than SW encrypt before queuing to TCP. Hence my first thought is using
> SW crypto for new key and let the traffic we already queued with old
> key drain leveraging HW crypto. But as I said the impact on performance
> when not rekeying is more important, and so is driver simplicity.

Right, sorry, full tls_sw path and not the existing fallback.

Changing the socket ops back and forth between the HW and SW variants
worries me, because we only lock the socket once we have entered
tls_{device,sw}_sendmsg. So I think we have to stay on the _device ops
even during the SW crypto phase of the rekey, and let that call into
the SW variant after locking the socket and making sure we're in a
rekey.

> > > We may need special care to make sure we don't try to encrypt the same
> > > packet with both keys. In case a rtx gets stuck somewhere and comes to
> > > the NIC after it's already acked (happens surprisingly often).  
> > 
> > Don't we have that already? If there's a retransmit while we're
> > setting the TX key in HW, data that was queued on the socket before
> > (and shouldn't be encrypted at all) would also be encrypted
> > otherwise. Or is it different with rekey?
> 
> We have a "start marker" record which is supposed to indicate that
> anything before it has already been encrypted. The driver is programmed
> with the start seq no, when it sees a packet from before this seq no
> it checks if a record exists, finds its before the start marker and
> sends the data as is.

Yes, I was looking into that earlier this week. I think we could reuse
a similar mechanism for rekeying. tls_dev_add takes tcp_sk->write_seq,
we could have a tls_dev_rekey op passing the new key and new write_seq
to the driver. I think we can also reuse the ->eor trick from
tls_set_device_offload, and we wouldn't have to look at
skb->decrypted. Close and push the current SW record, mark ->eor, pass
write_seq to the driver along with the key. Also pretty close to what
tls_device_resync_tx does.

[...]
> > This makes me wonder again if we should have fake offloads on veth
> > (still calling the kernel's crypto library to simulate a device doing
> > the encryption and/or decryption), to make it easy to play with the
> > software bits, without requiring actual hardware that can offload
> > TLS/IPsec/MACsec. But maybe it's too complex to be useful and we'd
> > just waste our time fixing bugs in the fake offload rather than
> > improving the stack.
> 
> It should be quite useful. I also usually just hack up veth, but 
> I reckon adding support to netdevsim would be a better fit.
> We just need a way to tell two netdevsim ports to "connect to each
> other".

Oh, nice idea. I'll add that to my todo list.
Jakub Kicinski Feb. 23, 2023, 5:29 p.m. UTC | #9
On Thu, 23 Feb 2023 17:27:40 +0100 Sabrina Dubroca wrote:
> 2023-02-21, 19:19:44 -0800, Jakub Kicinski wrote:
> > > "all data using the old key" needs to be one list of record per old
> > > key, since we can have multiple rekeys.  
> > 
> > No fully parsing this bit.  
> 
> We can have multiple rekeys in the time it takes to get an ACK for the
> first KeyUpdate message to be ACK'ed. I'm not sure why I talked about
> a "list of records".
> 
> But we could have this sequence of records:
> 
>   recN(k1,hwenc)
>   KeyUpdate(k1,hwenc)
>   // switch to k2 and sw crypto
> 
>   rec0(k2,swenc)
>   rec1(k2,swenc)
>   KeyUpdate(k2,swenc)
>   rec0(k3,swenc)
>   // receive ACK for KU1, don't enable HW offload for k2 or k3 because we've already switched off k2
> 
>   rec1(k3,swenc)
>   // receive ACK for KU2, now enable HW offload for k3
> 
>   rec2(k3,hwenc)
> 
> So we'll need to record the most recent TX rekey, and wait until the
> corresponding KU record is ACK'ed, before we resume offload using the
> most recent key (and skip possible intermediate keys).
> 
> Installing the key in HW and re-enabling the offload will need to
> happen via the icsk_clean_acked callback. We'll need a workqueue so
> that we don't actually talk to the driver from softirq.

Installing from icsk_clean_acked won't win us anything, right?
We'll only need the key once the next sendmsg() comes, what's
pushed to TCP with swenc is already out of our hands.

> Then, we have to handle a failure to install the key. Since we're not
> installing it in HW immediately during setsockopt, notifying userspace
> of a rekey failure is more complicated. Maybe we can do a
> rekey_prepare during the setsocktopt, and then the actual rekey is an
> operation that cannot fail?

TLS offload silently falls back to SW on any errors. So that's fine.
Just bump a counter. User/infra must be tracking error counters in 
our current design already.

> > Important consideration is making the non-rekey path as fast as
> > possible (given rekeying is extremely rare). Looking at skb->decrypted
> > should be very fast but we can possibly fit some other indication of
> > "are we rekeying" into another already referenced cache line.
> > We definitely don't want to have to look up the record to know what
> > state we're in.
> > 
> > The fallback can't use AES-NI (it's in sirq context) so it's slower 
> > than SW encrypt before queuing to TCP. Hence my first thought is using
> > SW crypto for new key and let the traffic we already queued with old
> > key drain leveraging HW crypto. But as I said the impact on performance
> > when not rekeying is more important, and so is driver simplicity.  
> 
> Right, sorry, full tls_sw path and not the existing fallback.
> 
> Changing the socket ops back and forth between the HW and SW variants
> worries me, because we only lock the socket once we have entered
> tls_{device,sw}_sendmsg. So I think we have to stay on the _device ops
> even during the SW crypto phase of the rekey, and let that call into
> the SW variant after locking the socket and making sure we're in a
> rekey.

Fair point :S

> > > Don't we have that already? If there's a retransmit while we're
> > > setting the TX key in HW, data that was queued on the socket before
> > > (and shouldn't be encrypted at all) would also be encrypted
> > > otherwise. Or is it different with rekey?  
> > 
> > We have a "start marker" record which is supposed to indicate that
> > anything before it has already been encrypted. The driver is programmed
> > with the start seq no, when it sees a packet from before this seq no
> > it checks if a record exists, finds its before the start marker and
> > sends the data as is.  
> 
> Yes, I was looking into that earlier this week. I think we could reuse
> a similar mechanism for rekeying. tls_dev_add takes tcp_sk->write_seq,
> we could have a tls_dev_rekey op passing the new key and new write_seq
> to the driver. I think we can also reuse the ->eor trick from
> tls_set_device_offload, and we wouldn't have to look at
> skb->decrypted. Close and push the current SW record, mark ->eor, pass
> write_seq to the driver along with the key. Also pretty close to what
> tls_device_resync_tx does.

That sounds like you'd expose the rekeying logic to the drivers?
New op, having to track seq#...
Sabrina Dubroca March 13, 2023, 3:41 p.m. UTC | #10
2023-02-23, 09:29:45 -0800, Jakub Kicinski wrote:
> On Thu, 23 Feb 2023 17:27:40 +0100 Sabrina Dubroca wrote:
> > Installing the key in HW and re-enabling the offload will need to
> > happen via the icsk_clean_acked callback. We'll need a workqueue so
> > that we don't actually talk to the driver from softirq.
> 
> Installing from icsk_clean_acked won't win us anything, right?
> We'll only need the key once the next sendmsg() comes, what's
> pushed to TCP with swenc is already out of our hands.

Avoiding an unpredictable slowdown on the sendmsg() call? We can deal
with that later if it turns out to be an issue. I simply didn't think
of deferring to the next sendmsg().

> > Then, we have to handle a failure to install the key. Since we're not
> > installing it in HW immediately during setsockopt, notifying userspace
> > of a rekey failure is more complicated. Maybe we can do a
> > rekey_prepare during the setsocktopt, and then the actual rekey is an
> > operation that cannot fail?
> 
> TLS offload silently falls back to SW on any errors. So that's fine.
> Just bump a counter. User/infra must be tracking error counters in 
> our current design already.

True. User might be a bit surprised by "well it was offloaded and now
it's not", but ok.

> > > Important consideration is making the non-rekey path as fast as
> > > possible (given rekeying is extremely rare). Looking at skb->decrypted
> > > should be very fast but we can possibly fit some other indication of
> > > "are we rekeying" into another already referenced cache line.
> > > We definitely don't want to have to look up the record to know what
> > > state we're in.
> > > 
> > > The fallback can't use AES-NI (it's in sirq context) so it's slower 
> > > than SW encrypt before queuing to TCP. Hence my first thought is using
> > > SW crypto for new key and let the traffic we already queued with old
> > > key drain leveraging HW crypto. But as I said the impact on performance
> > > when not rekeying is more important, and so is driver simplicity.  
> > 
> > Right, sorry, full tls_sw path and not the existing fallback.
> > 
> > Changing the socket ops back and forth between the HW and SW variants
> > worries me, because we only lock the socket once we have entered
> > tls_{device,sw}_sendmsg. So I think we have to stay on the _device ops
> > even during the SW crypto phase of the rekey, and let that call into
> > the SW variant after locking the socket and making sure we're in a
> > rekey.
> 
> Fair point :S
> 
> > > > Don't we have that already? If there's a retransmit while we're
> > > > setting the TX key in HW, data that was queued on the socket before
> > > > (and shouldn't be encrypted at all) would also be encrypted
> > > > otherwise. Or is it different with rekey?  
> > > 
> > > We have a "start marker" record which is supposed to indicate that
> > > anything before it has already been encrypted. The driver is programmed
> > > with the start seq no, when it sees a packet from before this seq no
> > > it checks if a record exists, finds its before the start marker and
> > > sends the data as is.  
> > 
> > Yes, I was looking into that earlier this week. I think we could reuse
> > a similar mechanism for rekeying. tls_dev_add takes tcp_sk->write_seq,
> > we could have a tls_dev_rekey op passing the new key and new write_seq
> > to the driver. I think we can also reuse the ->eor trick from
> > tls_set_device_offload, and we wouldn't have to look at
> > skb->decrypted. Close and push the current SW record, mark ->eor, pass
> > write_seq to the driver along with the key. Also pretty close to what
> > tls_device_resync_tx does.
> 
> That sounds like you'd expose the rekeying logic to the drivers?
> New op, having to track seq#...

Well, we have to call into the drivers to install the key, whether
that's a new rekey op, or adding an update argument to ->tls_dev_add,
or letting the driver guess that it's a rekey (or ignore that and just
install the key if rekey vs initial key isn't a meaningful
distinction).

We already feed drivers the seq# with ->tls_dev_add, so passing it for
rekeys as well is not a big change.

Does that seem problematic? Adding a rekey op seemed more natural to
me than simply using the existing _del + _add ops, but maybe we can
get away with just using those two ops.
Jakub Kicinski March 13, 2023, 6:35 p.m. UTC | #11
On Mon, 13 Mar 2023 16:41:36 +0100 Sabrina Dubroca wrote:
> > > Yes, I was looking into that earlier this week. I think we could reuse
> > > a similar mechanism for rekeying. tls_dev_add takes tcp_sk->write_seq,
> > > we could have a tls_dev_rekey op passing the new key and new write_seq
> > > to the driver. I think we can also reuse the ->eor trick from
> > > tls_set_device_offload, and we wouldn't have to look at
> > > skb->decrypted. Close and push the current SW record, mark ->eor, pass
> > > write_seq to the driver along with the key. Also pretty close to what
> > > tls_device_resync_tx does.  
> > 
> > That sounds like you'd expose the rekeying logic to the drivers?
> > New op, having to track seq#...  
> 
> Well, we have to call into the drivers to install the key, whether
> that's a new rekey op, or adding an update argument to ->tls_dev_add,
> or letting the driver guess that it's a rekey (or ignore that and just
> install the key if rekey vs initial key isn't a meaningful
> distinction).
> 
> We already feed drivers the seq# with ->tls_dev_add, so passing it for
> rekeys as well is not a big change.
> 
> Does that seem problematic? Adding a rekey op seemed more natural to
> me than simply using the existing _del + _add ops, but maybe we can
> get away with just using those two ops.

Theoretically a rekey op is nicer and cleaner. Practically the quality
of the driver implementations will vary wildly*, and it's a significant
time investment to review all of them. So for non-technical reasons my
intuition is that we'd deliver a better overall user experience if we
handled the rekey entirely in the core.

Wait for old key to no longer be needed, _del + _add, start using the
offload again.

* One vendor submitted a driver claiming support for TLS 1.3, when 
  TLS 1.3 offload was rejected by the core. So this is the level of
  testing and diligence we're working with :(
Sabrina Dubroca March 22, 2023, 4:10 p.m. UTC | #12
2023-03-13, 11:35:10 -0700, Jakub Kicinski wrote:
> On Mon, 13 Mar 2023 16:41:36 +0100 Sabrina Dubroca wrote:
> > > > Yes, I was looking into that earlier this week. I think we could reuse
> > > > a similar mechanism for rekeying. tls_dev_add takes tcp_sk->write_seq,
> > > > we could have a tls_dev_rekey op passing the new key and new write_seq
> > > > to the driver. I think we can also reuse the ->eor trick from
> > > > tls_set_device_offload, and we wouldn't have to look at
> > > > skb->decrypted. Close and push the current SW record, mark ->eor, pass
> > > > write_seq to the driver along with the key. Also pretty close to what
> > > > tls_device_resync_tx does.  
> > > 
> > > That sounds like you'd expose the rekeying logic to the drivers?
> > > New op, having to track seq#...  
> > 
> > Well, we have to call into the drivers to install the key, whether
> > that's a new rekey op, or adding an update argument to ->tls_dev_add,
> > or letting the driver guess that it's a rekey (or ignore that and just
> > install the key if rekey vs initial key isn't a meaningful
> > distinction).
> > 
> > We already feed drivers the seq# with ->tls_dev_add, so passing it for
> > rekeys as well is not a big change.
> > 
> > Does that seem problematic? Adding a rekey op seemed more natural to
> > me than simply using the existing _del + _add ops, but maybe we can
> > get away with just using those two ops.
> 
> Theoretically a rekey op is nicer and cleaner. Practically the quality
> of the driver implementations will vary wildly*, and it's a significant
> time investment to review all of them. So for non-technical reasons my
> intuition is that we'd deliver a better overall user experience if we
> handled the rekey entirely in the core.
> 
> Wait for old key to no longer be needed, _del + _add, start using the
> offload again.
> 
> * One vendor submitted a driver claiming support for TLS 1.3, when 
>   TLS 1.3 offload was rejected by the core. So this is the level of
>   testing and diligence we're working with :(

:(

Ok, _del + _add then.


I went over the thread to summarize what we've come up with so far:

RX
 - The existing SW path will handle all records between the KeyUpdate
   message signaling the change of key and the new key becoming known
   to the kernel -- those will be queued encrypted, and decrypted in
   SW as they are read by userspace (once the key is provided, ie same
   as this patchset)
 - Call ->tls_dev_del + ->tls_dev_add immediately during
   setsockopt(TLS_RX)

TX
 - After setsockopt(TLS_TX), switch to the existing SW path (not the
   current device_fallback) until we're able to re-enable HW offload
   - tls_device_{sendmsg,sendpage} will call into
     tls_sw_{sendmsg,sendpage} under lock_sock to avoid changing
     socket ops during the rekey while another thread might be waiting
     on the lock
 - We only re-enable HW offload (call ->tls_dev_add to install the new
   key in HW) once all records sent with the old key have been
   ACKed. At this point, all unacked records are SW-encrypted with the
   new key, and the old key is unused by both HW and retransmissions.
   - If there are no unacked records when userspace does
     setsockopt(TLS_TX), we can (try to) install the new key in HW
     immediately.
   - If yet another key has been provided via setsockopt(TLS_TX), we
     don't install intermediate keys, only the latest.
   - TCP notifies ktls of ACKs via the icsk_clean_acked callback. In
     case of a rekey, tls_icsk_clean_acked will record when all data
     sent with the most recent past key has been sent. The next call
     to sendmsg/sendpage will install the new key in HW.
   - We close and push the current SW record before reenabling
     offload.

If ->tls_dev_add fails to install the new key in HW, we stay in SW
mode. We can add a counter to keep track of this.


In addition:

Because we can't change socket ops during a rekey, we'll also have to
modify do_tls_setsockopt_conf to check ctx->tx_conf and only call
either tls_set_device_offload or tls_set_sw_offload. RX already uses
the same ops for both TLS_HW and TLS_SW, so we could switch between HW
and SW mode on rekey.

An alternative would be to have a common sendmsg/sendpage which locks
the socket and then calls the correct implementation. We'll need that
anyway for the offload under rekey case, so that would only add a test
to the SW path's ops (compared to the current code). That should allow
us to make build_protos a lot simpler.
Jakub Kicinski March 22, 2023, 7:43 p.m. UTC | #13
On Wed, 22 Mar 2023 17:10:25 +0100 Sabrina Dubroca wrote:
> > Theoretically a rekey op is nicer and cleaner. Practically the quality
> > of the driver implementations will vary wildly*, and it's a significant
> > time investment to review all of them. So for non-technical reasons my
> > intuition is that we'd deliver a better overall user experience if we
> > handled the rekey entirely in the core.
> > 
> > Wait for old key to no longer be needed, _del + _add, start using the
> > offload again.
> > 
> > * One vendor submitted a driver claiming support for TLS 1.3, when 
> >   TLS 1.3 offload was rejected by the core. So this is the level of
> >   testing and diligence we're working with :(  
> 
> :(
> 
> Ok, _del + _add then.
> 
> 
> I went over the thread to summarize what we've come up with so far:
> 
> RX
>  - The existing SW path will handle all records between the KeyUpdate
>    message signaling the change of key and the new key becoming known
>    to the kernel -- those will be queued encrypted, and decrypted in
>    SW as they are read by userspace (once the key is provided, ie same
>    as this patchset)
>  - Call ->tls_dev_del + ->tls_dev_add immediately during
>    setsockopt(TLS_RX)
> 
> TX
>  - After setsockopt(TLS_TX), switch to the existing SW path (not the
>    current device_fallback) until we're able to re-enable HW offload
>    - tls_device_{sendmsg,sendpage} will call into
>      tls_sw_{sendmsg,sendpage} under lock_sock to avoid changing
>      socket ops during the rekey while another thread might be waiting
>      on the lock
>  - We only re-enable HW offload (call ->tls_dev_add to install the new
>    key in HW) once all records sent with the old key have been
>    ACKed. At this point, all unacked records are SW-encrypted with the
>    new key, and the old key is unused by both HW and retransmissions.
>    - If there are no unacked records when userspace does
>      setsockopt(TLS_TX), we can (try to) install the new key in HW
>      immediately.
>    - If yet another key has been provided via setsockopt(TLS_TX), we
>      don't install intermediate keys, only the latest.
>    - TCP notifies ktls of ACKs via the icsk_clean_acked callback. In
>      case of a rekey, tls_icsk_clean_acked will record when all data
>      sent with the most recent past key has been sent. The next call
>      to sendmsg/sendpage will install the new key in HW.
>    - We close and push the current SW record before reenabling
>      offload.
> 
> If ->tls_dev_add fails to install the new key in HW, we stay in SW
> mode. We can add a counter to keep track of this.

SG!

> In addition:
> 
> Because we can't change socket ops during a rekey, we'll also have to
> modify do_tls_setsockopt_conf to check ctx->tx_conf and only call
> either tls_set_device_offload or tls_set_sw_offload. RX already uses
> the same ops for both TLS_HW and TLS_SW, so we could switch between HW
> and SW mode on rekey.
> 
> An alternative would be to have a common sendmsg/sendpage which locks
> the socket and then calls the correct implementation. We'll need that
> anyway for the offload under rekey case, so that would only add a test
> to the SW path's ops (compared to the current code). That should allow
> us to make build_protos a lot simpler.

No preference assuming perf is the same.