diff mbox series

[RFC,4/5] net/tls: Add support for PF_TLSH (a TLS handshake listener)

Message ID 165030059051.5073.16723746870370826608.stgit@oracle-102.nfsv4.dev (mailing list archive)
State RFC
Delegated to: Netdev Maintainers
Headers show
Series Implement a TLS handshake upcall | expand

Checks

Context Check Description
netdev/fixes_present success Fixes tag not required for -next series
netdev/subject_prefix success Link
netdev/cover_letter success Series has a cover letter
netdev/patch_count success Link
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 6606 this patch: 6606
netdev/cc_maintainers warning 10 maintainers not CCed: jk@codeconstruct.com.au davem@davemloft.net corbet@lwn.net pabeni@redhat.com daniel@iogearbox.net linux-doc@vger.kernel.org changbin.du@intel.com john.fastabend@gmail.com kuba@kernel.org edumazet@google.com
netdev/build_clang success Errors and warnings before: 1729 this patch: 1729
netdev/module_param success Was 0 now: 0
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 12085 this patch: 12085
netdev/checkpatch warning CHECK: Comparison to NULL could be written "ctx" CHECK: Please don't use multiple blank lines CHECK: Unnecessary parentheses around 'len < sizeof(key_serial_t)' CHECK: extern prototypes should be avoided in .h files WARNING: Missing or malformed SPDX-License-Identifier tag in line 1 WARNING: added, moved or deleted file(s), does MAINTAINERS need updating? WARNING: braces {} are not necessary for any arm of this statement WARNING: line length of 81 exceeds 80 columns WARNING: line length of 82 exceeds 80 columns WARNING: line length of 83 exceeds 80 columns WARNING: line length of 84 exceeds 80 columns WARNING: networking block comments don't use an empty /* line, use /* Comment...
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0
netdev/tree_selection success Guessing tree name failed - patch did not apply, async

Commit Message

Chuck Lever April 18, 2022, 4:49 p.m. UTC
In-kernel TLS consumers need a way to perform a TLS handshake. In
the absence of a handshake implementation in the kernel itself, a
mechanism to perform the handshake in user space, using an existing
TLS handshake library, is necessary.

I've designed a way to pass a connected kernel socket endpoint to
user space using the traditional listen/accept mechanism. accept(2)
gives us a well-understood way to materialize a socket endpoint as a
normal file descriptor in a specific user space process. Like any
open socket descriptor, the accepted FD can then be passed to a
library such as openSSL to perform a TLS handshake.

This prototype currently handles only initiating client-side TLS
handshakes. Server-side handshakes and key renegotiation are left
to do.

Security Considerations
~~~~~~~~ ~~~~~~~~~~~~~~

This prototype is net-namespace aware.

The kernel has no mechanism to attest that the listening user space
agent is trustworthy.

Currently the prototype does not handle multiple listeners that
overlap -- multiple listeners in the same net namespace that have
overlapping bind addresses.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 .../networking/tls-in-kernel-handshake.rst         |  103 ++
 include/linux/socket.h                             |    1 
 include/net/sock.h                                 |    3 
 include/net/tls.h                                  |   15 
 include/net/tlsh.h                                 |   22 
 include/uapi/linux/tls.h                           |   16 
 net/core/sock.c                                    |    2 
 net/tls/Makefile                                   |    2 
 net/tls/af_tlsh.c                                  | 1040 ++++++++++++++++++++
 net/tls/tls_main.c                                 |   10 
 10 files changed, 1213 insertions(+), 1 deletion(-)
 create mode 100644 Documentation/networking/tls-in-kernel-handshake.rst
 create mode 100644 include/net/tlsh.h
 create mode 100644 net/tls/af_tlsh.c

Comments

Hannes Reinecke April 21, 2022, 7:36 a.m. UTC | #1
On 4/18/22 18:49, Chuck Lever wrote:
> In-kernel TLS consumers need a way to perform a TLS handshake. In
> the absence of a handshake implementation in the kernel itself, a
> mechanism to perform the handshake in user space, using an existing
> TLS handshake library, is necessary.
> 
> I've designed a way to pass a connected kernel socket endpoint to
> user space using the traditional listen/accept mechanism. accept(2)
> gives us a well-understood way to materialize a socket endpoint as a
> normal file descriptor in a specific user space process. Like any
> open socket descriptor, the accepted FD can then be passed to a
> library such as openSSL to perform a TLS handshake.
> 
> This prototype currently handles only initiating client-side TLS
> handshakes. Server-side handshakes and key renegotiation are left
> to do.
> 
> Security Considerations
> ~~~~~~~~ ~~~~~~~~~~~~~~
> 
> This prototype is net-namespace aware.
> 
> The kernel has no mechanism to attest that the listening user space
> agent is trustworthy.
> 
> Currently the prototype does not handle multiple listeners that
> overlap -- multiple listeners in the same net namespace that have
> overlapping bind addresses.
> 
> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> ---
>   .../networking/tls-in-kernel-handshake.rst         |  103 ++
>   include/linux/socket.h                             |    1
>   include/net/sock.h                                 |    3
>   include/net/tls.h                                  |   15
>   include/net/tlsh.h                                 |   22
>   include/uapi/linux/tls.h                           |   16
>   net/core/sock.c                                    |    2
>   net/tls/Makefile                                   |    2
>   net/tls/af_tlsh.c                                  | 1040 ++++++++++++++++++++
>   net/tls/tls_main.c                                 |   10
>   10 files changed, 1213 insertions(+), 1 deletion(-)
>   create mode 100644 Documentation/networking/tls-in-kernel-handshake.rst
>   create mode 100644 include/net/tlsh.h
>   create mode 100644 net/tls/af_tlsh.c
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
Jakub Kicinski April 25, 2022, 5:14 p.m. UTC | #2
On Mon, 18 Apr 2022 12:49:50 -0400 Chuck Lever wrote:
> In-kernel TLS consumers need a way to perform a TLS handshake. In
> the absence of a handshake implementation in the kernel itself, a
> mechanism to perform the handshake in user space, using an existing
> TLS handshake library, is necessary.
> 
> I've designed a way to pass a connected kernel socket endpoint to
> user space using the traditional listen/accept mechanism. accept(2)
> gives us a well-understood way to materialize a socket endpoint as a
> normal file descriptor in a specific user space process. Like any
> open socket descriptor, the accepted FD can then be passed to a
> library such as openSSL to perform a TLS handshake.
> 
> This prototype currently handles only initiating client-side TLS
> handshakes. Server-side handshakes and key renegotiation are left
> to do.
> 
> Security Considerations
> ~~~~~~~~ ~~~~~~~~~~~~~~
> 
> This prototype is net-namespace aware.
> 
> The kernel has no mechanism to attest that the listening user space
> agent is trustworthy.
> 
> Currently the prototype does not handle multiple listeners that
> overlap -- multiple listeners in the same net namespace that have
> overlapping bind addresses.

Create the socket in user space, do all the handshakes you need there
and then pass it to the kernel.  This is how NBD + TLS works.  Scales
better and requires much less kernel code.
Hannes Reinecke April 26, 2022, 9:43 a.m. UTC | #3
On 4/25/22 19:14, Jakub Kicinski wrote:
> On Mon, 18 Apr 2022 12:49:50 -0400 Chuck Lever wrote:
>> In-kernel TLS consumers need a way to perform a TLS handshake. In
>> the absence of a handshake implementation in the kernel itself, a
>> mechanism to perform the handshake in user space, using an existing
>> TLS handshake library, is necessary.
>>
>> I've designed a way to pass a connected kernel socket endpoint to
>> user space using the traditional listen/accept mechanism. accept(2)
>> gives us a well-understood way to materialize a socket endpoint as a
>> normal file descriptor in a specific user space process. Like any
>> open socket descriptor, the accepted FD can then be passed to a
>> library such as openSSL to perform a TLS handshake.
>>
>> This prototype currently handles only initiating client-side TLS
>> handshakes. Server-side handshakes and key renegotiation are left
>> to do.
>>
>> Security Considerations
>> ~~~~~~~~ ~~~~~~~~~~~~~~
>>
>> This prototype is net-namespace aware.
>>
>> The kernel has no mechanism to attest that the listening user space
>> agent is trustworthy.
>>
>> Currently the prototype does not handle multiple listeners that
>> overlap -- multiple listeners in the same net namespace that have
>> overlapping bind addresses.
> 
> Create the socket in user space, do all the handshakes you need there
> and then pass it to the kernel.  This is how NBD + TLS works.  Scales
> better and requires much less kernel code.
> 
But we can't, as the existing mechanisms (at least for NVMe) creates the 
socket in-kernel.
Having to create the socket in userspace would require a completely new 
interface for nvme and will not be backwards compatible.
Not to mention having to rework the nvme driver to accept sockets from 
userspace instead of creating them internally.

With this approach we can keep existing infrastructure, and can get a 
common implementation for either transport.

Cheers,

Hannes
Chuck Lever April 26, 2022, 1:48 p.m. UTC | #4
Hi Jakub-

> On Apr 25, 2022, at 1:14 PM, Jakub Kicinski <kuba@kernel.org> wrote:
> 
> On Mon, 18 Apr 2022 12:49:50 -0400 Chuck Lever wrote:
>> In-kernel TLS consumers need a way to perform a TLS handshake. In
>> the absence of a handshake implementation in the kernel itself, a
>> mechanism to perform the handshake in user space, using an existing
>> TLS handshake library, is necessary.
>> 
>> I've designed a way to pass a connected kernel socket endpoint to
>> user space using the traditional listen/accept mechanism. accept(2)
>> gives us a well-understood way to materialize a socket endpoint as a
>> normal file descriptor in a specific user space process. Like any
>> open socket descriptor, the accepted FD can then be passed to a
>> library such as openSSL to perform a TLS handshake.
>> 
>> This prototype currently handles only initiating client-side TLS
>> handshakes. Server-side handshakes and key renegotiation are left
>> to do.
>> 
>> Security Considerations
>> ~~~~~~~~ ~~~~~~~~~~~~~~
>> 
>> This prototype is net-namespace aware.
>> 
>> The kernel has no mechanism to attest that the listening user space
>> agent is trustworthy.
>> 
>> Currently the prototype does not handle multiple listeners that
>> overlap -- multiple listeners in the same net namespace that have
>> overlapping bind addresses.
> 
> Create the socket in user space, do all the handshakes you need there
> and then pass it to the kernel.  This is how NBD + TLS works.  Scales
> better and requires much less kernel code.

The RPC-with-TLS standard allows unencrypted RPC traffic on the connection
before sending ClientHello. I think we'd like to stick with creating the
socket in the kernel, for this reason and for the reasons Hannes mentions
in his reply.

--
Chuck Lever
Sagi Grimberg April 26, 2022, 2:29 p.m. UTC | #5
>>> Currently the prototype does not handle multiple listeners that
>>> overlap -- multiple listeners in the same net namespace that have
>>> overlapping bind addresses.
>>
>> Create the socket in user space, do all the handshakes you need there
>> and then pass it to the kernel.  This is how NBD + TLS works.  Scales
>> better and requires much less kernel code.
>>
> But we can't, as the existing mechanisms (at least for NVMe) creates the 
> socket in-kernel.
> Having to create the socket in userspace would require a completely new 
> interface for nvme and will not be backwards compatible.

And we will still need the upcall anyways when we reconnect 
(re-establish the socket)
Jakub Kicinski April 26, 2022, 2:55 p.m. UTC | #6
On Tue, 26 Apr 2022 11:43:37 +0200 Hannes Reinecke wrote:
> > Create the socket in user space, do all the handshakes you need there
> > and then pass it to the kernel.  This is how NBD + TLS works.  Scales
> > better and requires much less kernel code.
> >   
> But we can't, as the existing mechanisms (at least for NVMe) creates the 
> socket in-kernel.
> Having to create the socket in userspace would require a completely new 
> interface for nvme and will not be backwards compatible.
> Not to mention having to rework the nvme driver to accept sockets from 
> userspace instead of creating them internally.
> 
> With this approach we can keep existing infrastructure, and can get a 
> common implementation for either transport.

You add 1.5kLoC and require running a user space agent, surely you're
adding new interfaces and are not backward-compatible already.

I don't understand your argument, maybe you could rephrase / dumb it
down for me?
Jakub Kicinski April 26, 2022, 2:55 p.m. UTC | #7
On Tue, 26 Apr 2022 13:48:20 +0000 Chuck Lever III wrote:
> > Create the socket in user space, do all the handshakes you need there
> > and then pass it to the kernel.  This is how NBD + TLS works.  Scales
> > better and requires much less kernel code.  
> 
> The RPC-with-TLS standard allows unencrypted RPC traffic on the connection
> before sending ClientHello. I think we'd like to stick with creating the
> socket in the kernel, for this reason and for the reasons Hannes mentions
> in his reply.

Umpf, I presume that's reviewed by security people in IETF so I guess
it's done right this time (tm).

Your wording seems careful not to imply that you actually need that,
tho. Am I over-interpreting?
Jakub Kicinski April 26, 2022, 3:02 p.m. UTC | #8
On Tue, 26 Apr 2022 17:29:03 +0300 Sagi Grimberg wrote:
> >> Create the socket in user space, do all the handshakes you need there
> >> and then pass it to the kernel.  This is how NBD + TLS works.  Scales
> >> better and requires much less kernel code.
> >>  
> > But we can't, as the existing mechanisms (at least for NVMe) creates the 
> > socket in-kernel.
> > Having to create the socket in userspace would require a completely new 
> > interface for nvme and will not be backwards compatible.  
> 
> And we will still need the upcall anyways when we reconnect 
> (re-establish the socket)

That totally flew over my head, I have zero familiarity with in-kernel
storage network users :S

In all honesty the tls code in the kernel is a bit of a dumping ground.
People come, dump a bunch of code and disappear. Nobody seems to care
that the result is still (years in) not ready for production use :/
Until a month ago it'd break connections even under moderate memory
pressure. This set does not even have selftests.

Plus there are more protocols being actively worked on (QUIC, PSP etc.)
Having per ULP special sauce to invoke a user space helper is not the
paradigm we chose, and the time as inopportune as ever to change that.
Chuck Lever April 26, 2022, 3:58 p.m. UTC | #9
> On Apr 26, 2022, at 10:55 AM, Jakub Kicinski <kuba@kernel.org> wrote:
> 
> On Tue, 26 Apr 2022 13:48:20 +0000 Chuck Lever III wrote:
>>> Create the socket in user space, do all the handshakes you need there
>>> and then pass it to the kernel.  This is how NBD + TLS works.  Scales
>>> better and requires much less kernel code.  
>> 
>> The RPC-with-TLS standard allows unencrypted RPC traffic on the connection
>> before sending ClientHello. I think we'd like to stick with creating the
>> socket in the kernel, for this reason and for the reasons Hannes mentions
>> in his reply.
> 
> Umpf, I presume that's reviewed by security people in IETF so I guess
> it's done right this time (tm).

> Your wording seems careful not to imply that you actually need that,
> tho. Am I over-interpreting?

RPC-with-TLS requires one RPC as a "starttls" token. That could be
done in user space as part of the handshake, but it is currently
done in the kernel to enable the user agent to be shared with other
kernel consumers of TLS. Keep in mind that we already have two
real consumers: NVMe and RPC-with-TLS; and possibly QUIC.

You asserted earlier that creating sockets in user space "scales
better" but did not provide any data. Can we see some? How well
does it need to scale for storage protocols that use long-lived
connections?

Also, why has no-one mentioned the NBD on TLS implementation to
us before? I will try to review that code soon.


> This set does not even have selftests.

I can include unit tests with the prototype. Someone needs to
educate me on what is the preferred unit test paradigm for this
type of subsystem. Examples in the current kernel code base would
help too.


> Plus there are more protocols being actively worked on (QUIC, PSP etc.)
> Having per ULP special sauce to invoke a user space helper is not the
> paradigm we chose, and the time as inopportune as ever to change that.

When we started discussing TLS handshake requirements with some
community members several years ago, creating the socket in
kernel and passing it up to a user agent was the suggested design.
Has that recommendation changed since then?

I'd prefer an in-kernel handshake implementation over a user
space one (even one that is sharable amongst transports and ULPs
as my proposal is intended to be). However, so far we've been told
that an in-kernel handshake implementation is a non-starter.

But in the abstract, we agree that having a single TLS handshake
mechanism for kernel consumers is preferable.


--
Chuck Lever
Hannes Reinecke April 26, 2022, 3:58 p.m. UTC | #10
On 4/26/22 17:02, Jakub Kicinski wrote:
> On Tue, 26 Apr 2022 17:29:03 +0300 Sagi Grimberg wrote:
>>>> Create the socket in user space, do all the handshakes you need there
>>>> and then pass it to the kernel.  This is how NBD + TLS works.  Scales
>>>> better and requires much less kernel code.
>>>>   
>>> But we can't, as the existing mechanisms (at least for NVMe) creates the
>>> socket in-kernel.
>>> Having to create the socket in userspace would require a completely new
>>> interface for nvme and will not be backwards compatible.
>>
>> And we will still need the upcall anyways when we reconnect
>> (re-establish the socket)
> 
> That totally flew over my head, I have zero familiarity with in-kernel
> storage network users :S
> 
Count yourself lucky.

> In all honesty the tls code in the kernel is a bit of a dumping ground.
> People come, dump a bunch of code and disappear. Nobody seems to care
> that the result is still (years in) not ready for production use :/
> Until a month ago it'd break connections even under moderate memory
> pressure. This set does not even have selftests.
> 
Well, I'd been surprised that it worked, too.
And even more so that Boris Piskenny @ Nvidia is actively working on it.
(Thanks, Sagi!)

> Plus there are more protocols being actively worked on (QUIC, PSP etc.)
> Having per ULP special sauce to invoke a user space helper is not the
> paradigm we chose, and the time as inopportune as ever to change that.

Which is precisely what we hope to discuss at LSF.
(Yes, I know, probably not the best venue to discuss network stuff ...)

Each approach has its drawbacks:

- Establishing sockets from userspace will cause issues during 
reconnection, as then someone (aka the kernel) will have to inform 
userspace that a new connection will need to be established.
(And that has to happen while the root filesystem is potentially 
inaccessible, so you can't just call arbitrary commands here)
(Especially call_usermodehelper() is out of the game)
- Having ULP helpers (as with this design) mitigates that problem 
somewhat in the sense that you can mlock() that daemon and having it 
polling on an intermediate socket; that solves the notification problem.
But you have to have ULP special sauce here to make it work.
- Moving everything in kernel is ... possible. But then you have yet 
another security-relevant piece of code in the kernel which needs to be 
audited, CVEd etc. Not to mention the usual policy discussion whether it 
really belongs into the kernel.

So I don't really see any obvious way to go; best we can do is to pick 
the least ugly :-(

Cheers,

Hannes
Jakub Kicinski April 26, 2022, 11:47 p.m. UTC | #11
On Tue, 26 Apr 2022 15:58:29 +0000 Chuck Lever III wrote:
> > On Apr 26, 2022, at 10:55 AM, Jakub Kicinski <kuba@kernel.org> wrote:
> >> The RPC-with-TLS standard allows unencrypted RPC traffic on the connection
> >> before sending ClientHello. I think we'd like to stick with creating the
> >> socket in the kernel, for this reason and for the reasons Hannes mentions
> >> in his reply.  
> > 
> > Umpf, I presume that's reviewed by security people in IETF so I guess
> > it's done right this time (tm).  
> 
> > Your wording seems careful not to imply that you actually need that,
> > tho. Am I over-interpreting?  
> 
> RPC-with-TLS requires one RPC as a "starttls" token. That could be
> done in user space as part of the handshake, but it is currently
> done in the kernel to enable the user agent to be shared with other
> kernel consumers of TLS. Keep in mind that we already have two
> real consumers: NVMe and RPC-with-TLS; and possibly QUIC.
> 
> You asserted earlier that creating sockets in user space "scales
> better" but did not provide any data. Can we see some? How well
> does it need to scale for storage protocols that use long-lived
> connections?

I meant scale with the number of possible crypto protocols, 
I mentioned three there.

> Also, why has no-one mentioned the NBD on TLS implementation to
> us before? I will try to review that code soon.

Oops, maybe that thing had never seen the light of a public mailing
list then :S Dave Watson was working on it at Facebook, but he since
moved to greener pastures.

> > This set does not even have selftests.  
> 
> I can include unit tests with the prototype. Someone needs to
> educate me on what is the preferred unit test paradigm for this
> type of subsystem. Examples in the current kernel code base would
> help too.

Whatever level of testing makes you as an engineer comfortable
with saying "this test suite is sufficient"? ;)

For TLS we have tools/testing/selftests/net/tls.c - it's hardly
an example of excellence but, you know, it catches bugs here and 
there.

> > Plus there are more protocols being actively worked on (QUIC, PSP etc.)
> > Having per ULP special sauce to invoke a user space helper is not the
> > paradigm we chose, and the time as inopportune as ever to change that.  
> 
> When we started discussing TLS handshake requirements with some
> community members several years ago, creating the socket in
> kernel and passing it up to a user agent was the suggested design.
> Has that recommendation changed since then?

Hm, do you remember who you discussed it with? Would be good 
to loop those folks in. I wasn't involved at the beginning of the 
TLS work, I know second hand that HW offload and nbd were involved 
and that the design went thru some serious re-architecting along 
the way. In the beginning there was a separate socket for control
records, and that was nacked.

But also (and perhaps most importantly) I'm not really objecting 
to creating the socket in the kernel. I'm primarily objecting to 
a second type of a special TLS socket which has TLS semantics.

> I'd prefer an in-kernel handshake implementation over a user
> space one (even one that is sharable amongst transports and ULPs
> as my proposal is intended to be). However, so far we've been told
> that an in-kernel handshake implementation is a non-starter.
> 
> But in the abstract, we agree that having a single TLS handshake
> mechanism for kernel consumers is preferable.

For some definition of "we" which doesn't not include me?
Jakub Kicinski April 27, 2022, 12:03 a.m. UTC | #12
On Tue, 26 Apr 2022 17:58:39 +0200 Hannes Reinecke wrote:
> > Plus there are more protocols being actively worked on (QUIC, PSP etc.)
> > Having per ULP special sauce to invoke a user space helper is not the
> > paradigm we chose, and the time as inopportune as ever to change that.  
> 
> Which is precisely what we hope to discuss at LSF.
> (Yes, I know, probably not the best venue to discuss network stuff ...)

Indeed.

> Each approach has its drawbacks:
> 
> - Establishing sockets from userspace will cause issues during 
> reconnection, as then someone (aka the kernel) will have to inform 
> userspace that a new connection will need to be established.
> (And that has to happen while the root filesystem is potentially 
> inaccessible, so you can't just call arbitrary commands here)
> (Especially call_usermodehelper() is out of the game)

Indeed, we may need _some_ form of a notification mechanism and that's
okay. Can be a (more generic) socket, can be something based on existing
network storage APIs (IDK what you have there).

My thinking was that establishing the session in user space would be
easiest. We wouldn't need all the special getsockopt()s which AFAIU
work around part of the handshake being done in the kernel, and which,
I hope we can agree, are not beautiful.

> - Having ULP helpers (as with this design) mitigates that problem 
> somewhat in the sense that you can mlock() that daemon and having it 
> polling on an intermediate socket; that solves the notification problem.
> But you have to have ULP special sauce here to make it work.

TBH I don't see how this is much different to option 1 in terms of
constraints & requirements on the user space agent. We can implement
option 1 over a socket-like interface, too, and that'll carry
notifications all the same.

> - Moving everything in kernel is ... possible. But then you have yet 
> another security-relevant piece of code in the kernel which needs to be 
> audited, CVEd etc. Not to mention the usual policy discussion whether it 
> really belongs into the kernel.

Yeah, if that gets posted it'd be great if it includes removing me from
the TLS maintainers 'cause I want to sleep at night ;)

> So I don't really see any obvious way to go; best we can do is to pick 
> the least ugly :-(

True, I'm sure we can find some middle ground between 1 and 2.
Preferably implemented in a way where the mechanism is separated 
from the fact it's carrying TLS handshake requests, so that it can
carry something else tomorrow.
Chuck Lever April 27, 2022, 2:42 p.m. UTC | #13
> On Apr 26, 2022, at 7:47 PM, Jakub Kicinski <kuba@kernel.org> wrote:
> 
> On Tue, 26 Apr 2022 15:58:29 +0000 Chuck Lever III wrote:
>>> On Apr 26, 2022, at 10:55 AM, Jakub Kicinski <kuba@kernel.org> wrote:
>>>> The RPC-with-TLS standard allows unencrypted RPC traffic on the connection
>>>> before sending ClientHello. I think we'd like to stick with creating the
>>>> socket in the kernel, for this reason and for the reasons Hannes mentions
>>>> in his reply.  
>>> 
>>> Umpf, I presume that's reviewed by security people in IETF so I guess
>>> it's done right this time (tm).  
>> 
>>> Your wording seems careful not to imply that you actually need that,
>>> tho. Am I over-interpreting?  
>> 
>> RPC-with-TLS requires one RPC as a "starttls" token. That could be
>> done in user space as part of the handshake, but it is currently
>> done in the kernel to enable the user agent to be shared with other
>> kernel consumers of TLS. Keep in mind that we already have two
>> real consumers: NVMe and RPC-with-TLS; and possibly QUIC.
>> 
>> You asserted earlier that creating sockets in user space "scales
>> better" but did not provide any data. Can we see some? How well
>> does it need to scale for storage protocols that use long-lived
>> connections?
> 
> I meant scale with the number of possible crypto protocols, 
> I mentioned three there.

I'm looking at previous emails. The "three crypto protocols"
don't stand out to me. Which ones?

The prototype has a "handshake type" option that enables the kernel
to request handshakes for different transport layer security
protocols. Is that the kind of scalability you mean?

For TLS, we expect to have at least:

 - ClientHello
  - X509
  - PSK
 - ServerHello
 - Re-key

It should be straightforward to add the ability to service
other handshake types.


>> Also, why has no-one mentioned the NBD on TLS implementation to
>> us before? I will try to review that code soon.
> 
> Oops, maybe that thing had never seen the light of a public mailing
> list then :S Dave Watson was working on it at Facebook, but he since
> moved to greener pastures.
> 
>>> This set does not even have selftests.  
>> 
>> I can include unit tests with the prototype. Someone needs to
>> educate me on what is the preferred unit test paradigm for this
>> type of subsystem. Examples in the current kernel code base would
>> help too.
> 
> Whatever level of testing makes you as an engineer comfortable
> with saying "this test suite is sufficient"? ;)
> 
> For TLS we have tools/testing/selftests/net/tls.c - it's hardly
> an example of excellence but, you know, it catches bugs here and 
> there.

My question wasn't clear, sorry. I meant, what framework is
appropriate to use for unit tests in this area?


>>> Plus there are more protocols being actively worked on (QUIC, PSP etc.)
>>> Having per ULP special sauce to invoke a user space helper is not the
>>> paradigm we chose, and the time as inopportune as ever to change that.  
>> 
>> When we started discussing TLS handshake requirements with some
>> community members several years ago, creating the socket in
>> kernel and passing it up to a user agent was the suggested design.
>> Has that recommendation changed since then?
> 
> Hm, do you remember who you discussed it with? Would be good 
> to loop those folks in.

Yes, I remember. Trond Myklebust discussed this with Dave Miller
during a hallway conversation at a conference (probably Plumbers)
in 2018 or 2019.

Trond is Cc'd on this thread via linux-nfs@ and Dave is Cc'd via
netdev@.

I also traded email with Boris Pismenny about this a year ago,
and if memory serves he also recommended passing an existing
socket up to user space. He is Cc'd on this directly.


> I wasn't involved at the beginning of the 
> TLS work, I know second hand that HW offload and nbd were involved 
> and that the design went thru some serious re-architecting along 
> the way. In the beginning there was a separate socket for control
> records, and that was nacked.
> 
> But also (and perhaps most importantly) I'm not really objecting 
> to creating the socket in the kernel. I'm primarily objecting to 
> a second type of a special TLS socket which has TLS semantics.

I don't understand your objection. Can you clarify?

AF_TLSH is a listen-only socket. It's just a rendezvous point
for passing a kernel socket up to user space. It doesn't have
any particular "TLS semantics". It's the user space agent
listening on that endpoint that implements particular handshake
behaviors.

In fact, if the name AF_TLSH gives you hives, that can be made
more generic. However, that makes it harder for the kernel to
figure out which listening endpoint handles handshake requests.


>> I'd prefer an in-kernel handshake implementation over a user
>> space one (even one that is sharable amongst transports and ULPs
>> as my proposal is intended to be). However, so far we've been told
>> that an in-kernel handshake implementation is a non-starter.
>> 
>> But in the abstract, we agree that having a single TLS handshake
>> mechanism for kernel consumers is preferable.
> 
> For some definition of "we" which doesn't not include me?

The double negative made me blink a couple of times.

I'm working with folks from the Linux NFS community, the
Linux block community, and the Linux SMB community. We
would be happy to include you in our effort, if you would
like to be more involved.


--
Chuck Lever
Chuck Lever April 27, 2022, 3:24 p.m. UTC | #14
> On Apr 26, 2022, at 8:03 PM, Jakub Kicinski <kuba@kernel.org> wrote:
> 
> On Tue, 26 Apr 2022 17:58:39 +0200 Hannes Reinecke wrote:
>> 
>> - Establishing sockets from userspace will cause issues during 
>> reconnection, as then someone (aka the kernel) will have to inform 
>> userspace that a new connection will need to be established.
>> (And that has to happen while the root filesystem is potentially 
>> inaccessible, so you can't just call arbitrary commands here)
>> (Especially call_usermodehelper() is out of the game)
> 
> Indeed, we may need _some_ form of a notification mechanism and that's
> okay. Can be a (more generic) socket, can be something based on existing
> network storage APIs (IDK what you have there).
> 
> My thinking was that establishing the session in user space would be
> easiest. We wouldn't need all the special getsockopt()s which AFAIU
> work around part of the handshake being done in the kernel, and which,
> I hope we can agree, are not beautiful.

In the prototype, the new socket options on AF_TLSH sockets
include:

#define TLSH_PRIORITIES        1       /* Retrieve TLS priorities string */
#define TLSH_PEERID            2       /* Retrieve peer identity */
#define TLSH_HANDSHAKE_TYPE    3       /* Retrieve handshake type */
#define TLSH_X509_CERTIFICATE  4       /* Retrieve x.509 certificate */

PRIORITIES is the TLS priorities string that the GnuTLS library
uses to parametrize the handshake (which TLS versions, ciphers,
and so on).

PEERID is a keyring serial number for the key that contains the
a Pre-Shared Key (for PSK handshakes) or the private key (for
x.509 handshakes).

HANDSHAKE_TYPE is an integer that represents the type of handshake
being requested: ClientHello, ServerHello, Rekey, and so on. This
option enables the repertoire of handshake types to be expanded.

X509_CERTIFICATE is a keyring serial number for the key that
contains an x.509 certificate.

When each handshake is complete, the handshake agent instantiates
the IV into the passed-in socket using existing kTLS socket options
before it returns the endpoint to the kernel.

There is nothing in these options that indicates to the handshake
agent what upper layer protocol is going to be used inside the TLS
session.

----

The new AF_TLSH socket options are not there because the handshake
is split between the kernel and user space. They are there because
the initial requester is (eg, in the case of NFS) mount.nfs, another
user space program. mount.nfs has to package up an x.509 cert or
pre-shared key and place it on a keyring to make it available to
the handshake agent.

The basic issue is that the administrative interfaces that
parametrize the handshakes are quite distant from the in-kernel
consumers that make handshake requests.

----

Further, in order to support server side TLS handshakes in the
kernel, we really do have to pass a kernel-created socket up to
user space. NFSD (and maybe the NVMe target) use in-kernel listeners
to accept incoming connections. Each new endpoint is created in the
kernel.

So if you truly seek generality in this facility, the user
space componentry must work with passed-in sockets rather than
creating them in user space.

--
Chuck Lever
Jakub Kicinski April 27, 2022, 11:53 p.m. UTC | #15
On Wed, 27 Apr 2022 14:42:53 +0000 Chuck Lever III wrote:
> > On Apr 26, 2022, at 7:47 PM, Jakub Kicinski <kuba@kernel.org> wrote:
> >> RPC-with-TLS requires one RPC as a "starttls" token. That could be
> >> done in user space as part of the handshake, but it is currently
> >> done in the kernel to enable the user agent to be shared with other
> >> kernel consumers of TLS. Keep in mind that we already have two
> >> real consumers: NVMe and RPC-with-TLS; and possibly QUIC.
> >> 
> >> You asserted earlier that creating sockets in user space "scales
> >> better" but did not provide any data. Can we see some? How well
> >> does it need to scale for storage protocols that use long-lived
> >> connections?  
> > 
> > I meant scale with the number of possible crypto protocols, 
> > I mentioned three there.  
> 
> I'm looking at previous emails. The "three crypto protocols"
> don't stand out to me. Which ones?

TLS, QUIC and PSP maybe that was in a different email that what you
quoted, sorry:
https://lore.kernel.org/all/20220426080247.19bbb64e@kernel.org/

PSP:
https://raw.githubusercontent.com/google/psp/main/doc/PSP_Arch_Spec.pdf

> The prototype has a "handshake type" option that enables the kernel
> to request handshakes for different transport layer security
> protocols. Is that the kind of scalability you mean?
> 
> For TLS, we expect to have at least:
> 
>  - ClientHello
>   - X509
>   - PSK
>  - ServerHello
>  - Re-key
> 
> It should be straightforward to add the ability to service
> other handshake types.
> 
> >> I can include unit tests with the prototype. Someone needs to
> >> educate me on what is the preferred unit test paradigm for this
> >> type of subsystem. Examples in the current kernel code base would
> >> help too.  
> > 
> > Whatever level of testing makes you as an engineer comfortable
> > with saying "this test suite is sufficient"? ;)
> > 
> > For TLS we have tools/testing/selftests/net/tls.c - it's hardly
> > an example of excellence but, you know, it catches bugs here and 
> > there.  
> 
> My question wasn't clear, sorry. I meant, what framework is
> appropriate to use for unit tests in this area?

Nothing area specific, tools/testing/selftests/kselftest_harness.h
is what the tls test uses.

> >> When we started discussing TLS handshake requirements with some
> >> community members several years ago, creating the socket in
> >> kernel and passing it up to a user agent was the suggested design.
> >> Has that recommendation changed since then?  
> > 
> > Hm, do you remember who you discussed it with? Would be good 
> > to loop those folks in.  
> 
> Yes, I remember. Trond Myklebust discussed this with Dave Miller
> during a hallway conversation at a conference (probably Plumbers)
> in 2018 or 2019.
> 
> Trond is Cc'd on this thread via linux-nfs@ and Dave is Cc'd via
> netdev@.
> 
> I also traded email with Boris Pismenny about this a year ago,
> and if memory serves he also recommended passing an existing
> socket up to user space. He is Cc'd on this directly.

I see.

> > I wasn't involved at the beginning of the 
> > TLS work, I know second hand that HW offload and nbd were involved 
> > and that the design went thru some serious re-architecting along 
> > the way. In the beginning there was a separate socket for control
> > records, and that was nacked.
> > 
> > But also (and perhaps most importantly) I'm not really objecting 
> > to creating the socket in the kernel. I'm primarily objecting to 
> > a second type of a special TLS socket which has TLS semantics.  
> 
> I don't understand your objection. Can you clarify?
> 
> AF_TLSH is a listen-only socket. It's just a rendezvous point
> for passing a kernel socket up to user space. It doesn't have
> any particular "TLS semantics". It's the user space agent
> listening on that endpoint that implements particular handshake
> behaviors.
> 
> In fact, if the name AF_TLSH gives you hives, that can be made
> more generic.

Yes, a more generic "user space please bake my socket" interface 
is what I'm leaning towards.

> However, that makes it harder for the kernel to
> figure out which listening endpoint handles handshake requests.

Right, the listening endpoint...

Is it possible to instead create a fd-passing-like structured message
which could carry the fd and all the relevant context (what goes 
via the getsockopt() now)? 

The user space agent can open such upcall socket, then bind to
whatever entity it wants to talk to on the kernel side and read
the notifications via recv()?
Chuck Lever April 28, 2022, 1:29 a.m. UTC | #16
> On Apr 27, 2022, at 7:53 PM, Jakub Kicinski <kuba@kernel.org> wrote:
> 
> On Wed, 27 Apr 2022 14:42:53 +0000 Chuck Lever III wrote:
>>> On Apr 26, 2022, at 7:47 PM, Jakub Kicinski <kuba@kernel.org> wrote:
>>>> RPC-with-TLS requires one RPC as a "starttls" token. That could be
>>>> done in user space as part of the handshake, but it is currently
>>>> done in the kernel to enable the user agent to be shared with other
>>>> kernel consumers of TLS. Keep in mind that we already have two
>>>> real consumers: NVMe and RPC-with-TLS; and possibly QUIC.
>>>> 
>>>> You asserted earlier that creating sockets in user space "scales
>>>> better" but did not provide any data. Can we see some? How well
>>>> does it need to scale for storage protocols that use long-lived
>>>> connections?  
>>> 
>>> I meant scale with the number of possible crypto protocols, 
>>> I mentioned three there.  
>> 
>> I'm looking at previous emails. The "three crypto protocols"
>> don't stand out to me. Which ones?
> 
> TLS, QUIC and PSP maybe that was in a different email that what you
> quoted, sorry:
> https://lore.kernel.org/all/20220426080247.19bbb64e@kernel.org/
> 
> PSP:
> https://raw.githubusercontent.com/google/psp/main/doc/PSP_Arch_Spec.pdf

During the design process, we discussed both TLS and QUIC handshake
requirements, which are nearly the same. QUIC will want a TLSv1.3
handshake on a UDP socket, effectively. We can support DTLS in a
similar fashion.

We hope that the proposed design can be used for all of those, and
barring anything unforeseen in the description of PSP you provided,
PSP can be supported as well.

The handshake agent is really only a shim around a TLS library.
There isn't much to it.


> Is it possible to instead create a fd-passing-like structured message
> which could carry the fd and all the relevant context (what goes 
> via the getsockopt() now)?
> 
> The user space agent can open such upcall socket, then bind to
> whatever entity it wants to talk to on the kernel side and read
> the notifications via recv()?

We considered this kind of design. A reasonable place to start there
would be to fabricate new NETLINK messages to do this. I don't see
much benefit over what is done now, it's just a different isomer of
syntactic sugar, but it could be considered.

The issue is how the connected socket is materialized in user space.
accept(2) is the historical way to instantiate an already connected
socket in a process's file table, and seems like a natural fit. When
the handshake agent is done with the handshake, it closes the socket.
This invokes the tlsh_release() function which can check whether the
IV implantation was successful.

So instead of an AF_TLSH listener we could use a named pipe or a
netlink socket and a blocking recv(), as long as there is a reasonable
solution to how a connected socket fd is attached to the handshake
agent process.

I'm flexible about the mechanism for passing handshake parameters.
Attaching them to the connected socket seems convenient, but perhaps
not aesthetic.


--
Chuck Lever
Hannes Reinecke April 28, 2022, 7:26 a.m. UTC | #17
On 4/27/22 02:03, Jakub Kicinski wrote:
> On Tue, 26 Apr 2022 17:58:39 +0200 Hannes Reinecke wrote:
>>> Plus there are more protocols being actively worked on (QUIC, PSP etc.)
>>> Having per ULP special sauce to invoke a user space helper is not the
>>> paradigm we chose, and the time as inopportune as ever to change that.
>>
>> Which is precisely what we hope to discuss at LSF.
>> (Yes, I know, probably not the best venue to discuss network stuff ...)
> 
> Indeed.
> 
>> Each approach has its drawbacks:
>>
>> - Establishing sockets from userspace will cause issues during
>> reconnection, as then someone (aka the kernel) will have to inform
>> userspace that a new connection will need to be established.
>> (And that has to happen while the root filesystem is potentially
>> inaccessible, so you can't just call arbitrary commands here)
>> (Especially call_usermodehelper() is out of the game)
> 
> Indeed, we may need _some_ form of a notification mechanism and that's
> okay. Can be a (more generic) socket, can be something based on existing
> network storage APIs (IDK what you have there).
> 
Which is one of the issues; we don't have any.
Network storage is accessed from userspace via read()/write(), and for 
control everyone rolls its own thing.
So speaking of a 'network storage API' is quite optimistic.

> My thinking was that establishing the session in user space would be
> easiest. We wouldn't need all the special getsockopt()s which AFAIU
> work around part of the handshake being done in the kernel, and which,
> I hope we can agree, are not beautiful.
> 
Well; that is open to debate.
During open-iscsi development (which followed your model of having a 
control- and dataplane split between userspace and kernel) we found that 
the model is great for keeping the in-kernel code simple.
But we also found that it's not so great once you come down to the 
gritty details; you have to duplicate quite some protocol handling in 
userspace, and you have to worry about session re-establishment.
Up to the point where we start wondering if moving things into userspace 
was a good idea at all...

As for this particular interface: the problem we're facing is that TLS 
has to be started in-band. For some reason NVMexpress decided not to 
follow the traditional method of establishing the socket, start TLS, and 
then start protocol processing, but it rather has specified a hybrid 
model: establishing the socket, doing authentication, _then_ start TLS, 
and only then continue with protocol processing on the TLS socket.

Moving _that_ into userspace would require us to move the most of the 
protocol logic into userspace, too. Which really is something we want to 
avoid, as this would be quite a code duplication.
Not to mention the maintenance burden, as issues would need to be fixed 
in two locations.
(Talk to me about iscsiuio.)

And that's before we even get to things like TLS session tickets, which 
opens yet another can of worms.

>> - Having ULP helpers (as with this design) mitigates that problem
>> somewhat in the sense that you can mlock() that daemon and having it
>> polling on an intermediate socket; that solves the notification problem.
>> But you have to have ULP special sauce here to make it work.
> 
> TBH I don't see how this is much different to option 1 in terms of
> constraints & requirements on the user space agent. We can implement
> option 1 over a socket-like interface, too, and that'll carry
> notifications all the same.
> 
>> - Moving everything in kernel is ... possible. But then you have yet
>> another security-relevant piece of code in the kernel which needs to be
>> audited, CVEd etc. Not to mention the usual policy discussion whether it
>> really belongs into the kernel.
> 
> Yeah, if that gets posted it'd be great if it includes removing me from
> the TLS maintainers 'cause I want to sleep at night ;)
> 
>> So I don't really see any obvious way to go; best we can do is to pick
>> the least ugly :-(
> 
> True, I'm sure we can find some middle ground between 1 and 2.
> Preferably implemented in a way where the mechanism is separated
> from the fact it's carrying TLS handshake requests, so that it can
> carry something else tomorrow.
> 
Which was actually our goal.
The whole thing started off with the problem on _how_ sockets could be 
passed between kernel and userspace and vice versa.
While there is fd passing between processes via AF_UNIX, there is no 
such mechanism between kernel and userspace.

So accept() was an easy way out as the implementation was quite simple.
And it was moved into net/tls as this was the primary use-case.
But it's by no means meant to be exclusive for TLS; we could expand it 
to other things once there is a need.
But we really didn't want to over-engineer things here.

However, if you have another idea, by all means, do tell.
It's just that I don't think creating sockets in userspace is a great 
fit for us.

Cheers,

Hannes
Boris Pismenny April 28, 2022, 8:49 a.m. UTC | #18
On 18/04/2022 19:49, Chuck Lever wrote:
> In-kernel TLS consumers need a way to perform a TLS handshake. In
> the absence of a handshake implementation in the kernel itself, a
> mechanism to perform the handshake in user space, using an existing
> TLS handshake library, is necessary.
>
> I've designed a way to pass a connected kernel socket endpoint to
> user space using the traditional listen/accept mechanism. accept(2)
> gives us a well-understood way to materialize a socket endpoint as a
> normal file descriptor in a specific user space process. Like any
> open socket descriptor, the accepted FD can then be passed to a
> library such as openSSL to perform a TLS handshake.
>
> This prototype currently handles only initiating client-side TLS
> handshakes. Server-side handshakes and key renegotiation are left
> to do.
>
> Security Considerations
> ~~~~~~~~ ~~~~~~~~~~~~~~
>
> This prototype is net-namespace aware.
>
> The kernel has no mechanism to attest that the listening user space
> agent is trustworthy.
>
> Currently the prototype does not handle multiple listeners that
> overlap -- multiple listeners in the same net namespace that have
> overlapping bind addresses.
>

Thanks for posting this. As we discussed offline, I think this approach
is more manageable compared to a full in-kernel TLS handshake. A while
ago, I've hacked around TLS to implement the data-path for NVMe-TLS and
the data-path is indeed very simple provided an infrastructure such as
this one.

Making this more generic is desirable, and this obviously requires
supporting multiple listeners for multiple protocols (TLS, DTLS, QUIC,
PSP, etc.), which suggests that it will reside somewhere outside of net/tls.
Moreover, there is a need to support (TLS) control messages here too.
These will occasionally require going back to the userspace daemon
during kernel packet processing. A few examples are handling: TLS rekey,
TLS close_notify, and TLS keepalives. I'm not saying that we need to
support everything from day-1, but there needs to be a way to support these.

A related kernel interface is the XFRM netlink where the kernel asks a
userspace daemon to perform an IKE handshake for establishing IPsec SAs.
This works well when the handshake runs on a different socket, perhaps
that interface can be extended to do handshakes on a given socket that
lives in the kernel without actually passing the fd to userespace. If we
avoid instantiating a full socket fd in userspace, then the need for an
accept(2) interface is reduced, right?
Simo Sorce April 28, 2022, 1:12 p.m. UTC | #19
On Thu, 2022-04-28 at 11:49 +0300, Boris Pismenny wrote:
> On 18/04/2022 19:49, Chuck Lever wrote:
> > In-kernel TLS consumers need a way to perform a TLS handshake. In
> > the absence of a handshake implementation in the kernel itself, a
> > mechanism to perform the handshake in user space, using an existing
> > TLS handshake library, is necessary.
> > 
> > I've designed a way to pass a connected kernel socket endpoint to
> > user space using the traditional listen/accept mechanism. accept(2)
> > gives us a well-understood way to materialize a socket endpoint as a
> > normal file descriptor in a specific user space process. Like any
> > open socket descriptor, the accepted FD can then be passed to a
> > library such as openSSL to perform a TLS handshake.
> > 
> > This prototype currently handles only initiating client-side TLS
> > handshakes. Server-side handshakes and key renegotiation are left
> > to do.
> > 
> > Security Considerations
> > ~~~~~~~~ ~~~~~~~~~~~~~~
> > 
> > This prototype is net-namespace aware.
> > 
> > The kernel has no mechanism to attest that the listening user space
> > agent is trustworthy.
> > 
> > Currently the prototype does not handle multiple listeners that
> > overlap -- multiple listeners in the same net namespace that have
> > overlapping bind addresses.
> > 
> 
> Thanks for posting this. As we discussed offline, I think this approach
> is more manageable compared to a full in-kernel TLS handshake. A while
> ago, I've hacked around TLS to implement the data-path for NVMe-TLS and
> the data-path is indeed very simple provided an infrastructure such as
> this one.
> 
> Making this more generic is desirable, and this obviously requires
> supporting multiple listeners for multiple protocols (TLS, DTLS, QUIC,
> PSP, etc.), which suggests that it will reside somewhere outside of net/tls.
> Moreover, there is a need to support (TLS) control messages here too.
> These will occasionally require going back to the userspace daemon
> during kernel packet processing. A few examples are handling: TLS rekey,
> TLS close_notify, and TLS keepalives. I'm not saying that we need to
> support everything from day-1, but there needs to be a way to support these.
> 
> A related kernel interface is the XFRM netlink where the kernel asks a
> userspace daemon to perform an IKE handshake for establishing IPsec SAs.
> This works well when the handshake runs on a different socket, perhaps
> that interface can be extended to do handshakes on a given socket that
> lives in the kernel without actually passing the fd to userespace. If we
> avoid instantiating a full socket fd in userspace, then the need for an
> accept(2) interface is reduced, right?

JFYI:
For in kernel NFSD hadnshakes we also use the gssproxy unix socket in
the kernel, which allows GSSAPI handshakes to be relayed from the
kernel to a user space listening daemon.

The infrastructure is technically already available and could be
reasonably simply extended to do TLS negotiations as well.

Not saying it is the best interface, but it is already available, and
already used by NFS code.

Simo.
Jakub Kicinski April 28, 2022, 1:30 p.m. UTC | #20
On Thu, 28 Apr 2022 09:26:41 +0200 Hannes Reinecke wrote:
> The whole thing started off with the problem on _how_ sockets could be 
> passed between kernel and userspace and vice versa.
> While there is fd passing between processes via AF_UNIX, there is no 
> such mechanism between kernel and userspace.

Noob question - the kernel <> user space FD sharing is just 
not implemented yet, or somehow fundamentally hard because kernel 
fds are "special"?
Hannes Reinecke April 28, 2022, 1:51 p.m. UTC | #21
On 4/28/22 15:30, Jakub Kicinski wrote:
> On Thu, 28 Apr 2022 09:26:41 +0200 Hannes Reinecke wrote:
>> The whole thing started off with the problem on _how_ sockets could be
>> passed between kernel and userspace and vice versa.
>> While there is fd passing between processes via AF_UNIX, there is no
>> such mechanism between kernel and userspace.
> 
> Noob question - the kernel <> user space FD sharing is just
> not implemented yet, or somehow fundamentally hard because kernel
> fds are "special"?

Noob reply: wish I knew.
(I somewhat hoped _you_ would've been able to tell me.)

Thing is, the only method I could think of for fd passing is the POSIX 
fd passing via unix_attach_fds()/unix_detach_fds().
But that's AF_UNIX, which really is designed for process-to-process 
communication, not process-to-kernel.
So you probably have to move a similar logic over to AF_NETLINK. And 
design a new interface on how fds should be passed over AF_NETLINK.

But then you have to face the issue that AF_NELINK is essentially UDP, 
and you have _no_ idea if and how many processes do listen on the other 
end. Thing is, you (as the sender) have to copy the fd over to the 
receiving process, so you'd better _hope_ there is a receiving process.
Not to mention that there might be several processes listening in...

And that's something I _definitely_ don't feel comfortable with without 
guidance from the networking folks, so I didn't pursue it further and we 
went with the 'accept()' mechanism Chuck implemented.

I'm open to suggestions, though.

Cheers,

Hannes
Benjamin Coddington April 28, 2022, 2:09 p.m. UTC | #22
On 28 Apr 2022, at 9:51, Hannes Reinecke wrote:

> On 4/28/22 15:30, Jakub Kicinski wrote:
>> On Thu, 28 Apr 2022 09:26:41 +0200 Hannes Reinecke wrote:
>>> The whole thing started off with the problem on _how_ sockets could be
>>> passed between kernel and userspace and vice versa.
>>> While there is fd passing between processes via AF_UNIX, there is no
>>> such mechanism between kernel and userspace.
>>
>> Noob question - the kernel <> user space FD sharing is just
>> not implemented yet, or somehow fundamentally hard because kernel
>> fds are "special"?
>
> Noob reply: wish I knew.  (I somewhat hoped _you_ would've been able to
> tell me.)
>
> Thing is, the only method I could think of for fd passing is the POSIX fd
> passing via unix_attach_fds()/unix_detach_fds().  But that's AF_UNIX,
> which really is designed for process-to-process communication, not
> process-to-kernel.  So you probably have to move a similar logic over to
> AF_NETLINK. And design a new interface on how fds should be passed over
> AF_NETLINK.
>
> But then you have to face the issue that AF_NELINK is essentially UDP, and
> you have _no_ idea if and how many processes do listen on the other end.
> Thing is, you (as the sender) have to copy the fd over to the receiving
> process, so you'd better _hope_ there is a receiving process.  Not to
> mention that there might be several processes listening in...
>
> And that's something I _definitely_ don't feel comfortable with without
> guidance from the networking folks, so I didn't pursue it further and we
> went with the 'accept()' mechanism Chuck implemented.
>
> I'm open to suggestions, though.

EXPORT_SYMBOL(receive_fd) would allow interesting implementations.

The kernel keyring facilities have a good API for creating various key_types
which are able to perform work such as this from userspace contexts.

I have a working prototype for a keyring key instantiation which allows a
userspace process to install a kernel fd on its file table.  The problem
here is how to match/route such fd passing to appropriate processes in
appropriate namespaces.  I think this problem is shared by all
kernel-to-userspace upcalls, which I hope we can discuss at LSF/MM.

I don't think kernel fds are very special as compared to userspace fds.

Ben
Chuck Lever April 28, 2022, 3:24 p.m. UTC | #23
> On Apr 28, 2022, at 4:49 AM, Boris Pismenny <borispismenny@gmail.com> wrote:
> 
> On 18/04/2022 19:49, Chuck Lever wrote:
>> In-kernel TLS consumers need a way to perform a TLS handshake. In
>> the absence of a handshake implementation in the kernel itself, a
>> mechanism to perform the handshake in user space, using an existing
>> TLS handshake library, is necessary.
>> 
>> I've designed a way to pass a connected kernel socket endpoint to
>> user space using the traditional listen/accept mechanism. accept(2)
>> gives us a well-understood way to materialize a socket endpoint as a
>> normal file descriptor in a specific user space process. Like any
>> open socket descriptor, the accepted FD can then be passed to a
>> library such as openSSL to perform a TLS handshake.
>> 
>> This prototype currently handles only initiating client-side TLS
>> handshakes. Server-side handshakes and key renegotiation are left
>> to do.
>> 
>> Security Considerations
>> ~~~~~~~~ ~~~~~~~~~~~~~~
>> 
>> This prototype is net-namespace aware.
>> 
>> The kernel has no mechanism to attest that the listening user space
>> agent is trustworthy.
>> 
>> Currently the prototype does not handle multiple listeners that
>> overlap -- multiple listeners in the same net namespace that have
>> overlapping bind addresses.
>> 
> 
> Thanks for posting this. As we discussed offline, I think this approach
> is more manageable compared to a full in-kernel TLS handshake. A while
> ago, I've hacked around TLS to implement the data-path for NVMe-TLS and
> the data-path is indeed very simple provided an infrastructure such as
> this one.
> 
> Making this more generic is desirable, and this obviously requires
> supporting multiple listeners for multiple protocols (TLS, DTLS, QUIC,
> PSP, etc.), which suggests that it will reside somewhere outside of net/tls.
> Moreover, there is a need to support (TLS) control messages here too.
> These will occasionally require going back to the userspace daemon
> during kernel packet processing. A few examples are handling: TLS rekey,
> TLS close_notify, and TLS keepalives. I'm not saying that we need to
> support everything from day-1, but there needs to be a way to support these.

I agree that control messages need to be handled as well. For the
moment, the prototype simply breaks the connection when a control
message is encountered, and a new session is negotiated. That of
course is not the desired long-term solution.

If we believe that control messages are going to be distinct for
each transport security layer, then perhaps we cannot make the
handshake mechanism generic -- it will have to be specific to
each security layer. Just a thought.


> A related kernel interface is the XFRM netlink where the kernel asks a
> userspace daemon to perform an IKE handshake for establishing IPsec SAs.
> This works well when the handshake runs on a different socket, perhaps
> that interface can be extended to do handshakes on a given socket that
> lives in the kernel without actually passing the fd to userespace. If we
> avoid instantiating a full socket fd in userspace, then the need for an
> accept(2) interface is reduced, right?

Certainly piping the handshake messages up to user space instead
of handing off a socket is possible. The TLS libraries would need
to tolerate this, and GnuTLS (at least) appears OK with performing
a handshake on an AF_TLSH socket.

However, I don't see a need to outright avoid passing a connected
endpoint to user space. The only difficulty with it seems to be
that it hasn't been done before in quite this way.


--
Chuck Lever
Jakub Kicinski April 28, 2022, 9:08 p.m. UTC | #24
On Thu, 28 Apr 2022 01:29:10 +0000 Chuck Lever III wrote:
> > Is it possible to instead create a fd-passing-like structured message
> > which could carry the fd and all the relevant context (what goes 
> > via the getsockopt() now)?
> > 
> > The user space agent can open such upcall socket, then bind to
> > whatever entity it wants to talk to on the kernel side and read
> > the notifications via recv()?  
> 
> We considered this kind of design. A reasonable place to start there
> would be to fabricate new NETLINK messages to do this. I don't see
> much benefit over what is done now, it's just a different isomer of
> syntactic sugar, but it could be considered.
> 
> The issue is how the connected socket is materialized in user space.
> accept(2) is the historical way to instantiate an already connected
> socket in a process's file table, and seems like a natural fit. When
> the handshake agent is done with the handshake, it closes the socket.
> This invokes the tlsh_release() function which can check 

Actually - is that strictly necessary? It seems reasonable for NFS to
check that it got TLS, since that's what it explicitly asks for per
standard. But it may not always be the goal. In large data center
networks there can be a policy the user space agent consults to choose
what security to install. It may end up doing the auth but not enable
crypto if confidentiality is deemed unnecessary.

Obviously you may not have those requirements but if we can cover them
without extra complexity it'd be great.

> whether the IV implantation was successful.

I'm used to IV meaning Initialization Vector in context of crypto,
what does "IV implementation" stand for?

> So instead of an AF_TLSH listener we could use a named pipe or a
> netlink socket and a blocking recv(), as long as there is a reasonable
> solution to how a connected socket fd is attached to the handshake
> agent process.
> 
> I'm flexible about the mechanism for passing handshake parameters.
> Attaching them to the connected socket seems convenient, but perhaps
> not aesthetic.

recv()-based version would certainly make me happy.
Jakub Kicinski April 28, 2022, 9:08 p.m. UTC | #25
On Thu, 28 Apr 2022 10:09:17 -0400 Benjamin Coddington wrote:
> > Noob reply: wish I knew.  (I somewhat hoped _you_ would've been able to
> > tell me.)
> >
> > Thing is, the only method I could think of for fd passing is the POSIX fd
> > passing via unix_attach_fds()/unix_detach_fds().  But that's AF_UNIX,
> > which really is designed for process-to-process communication, not
> > process-to-kernel.  So you probably have to move a similar logic over to
> > AF_NETLINK. And design a new interface on how fds should be passed over
> > AF_NETLINK.
> >
> > But then you have to face the issue that AF_NELINK is essentially UDP, and
> > you have _no_ idea if and how many processes do listen on the other end.
> > Thing is, you (as the sender) have to copy the fd over to the receiving
> > process, so you'd better _hope_ there is a receiving process.  Not to
> > mention that there might be several processes listening in...

Sort of. I double checked the netlink upcall implementations we have,
they work by user space entity "registering" their netlink address
(portid) at startup. Kernel then directs the upcalls to that address.
But AFAICT there's currently no way for the netlink "server" to see
when a "client" goes away, which makes me slightly uneasy about using
such schemes for security related stuff. The user agent may crash and
something else could grab the same address, I think.

Let me CC OvS who uses it the most, perhaps I'm missing a trick.

My thinking was to use the netlink attribute format (just to reuse the
helpers and parsing, but we can invent a new TLV format if needed) but
create a new socket type specifically for upcalls.

> > And that's something I _definitely_ don't feel comfortable with without
> > guidance from the networking folks, so I didn't pursue it further and we
> > went with the 'accept()' mechanism Chuck implemented.
> >
> > I'm open to suggestions, though.  
> 
> EXPORT_SYMBOL(receive_fd) would allow interesting implementations.
> 
> The kernel keyring facilities have a good API for creating various key_types
> which are able to perform work such as this from userspace contexts.
> 
> I have a working prototype for a keyring key instantiation which allows a
> userspace process to install a kernel fd on its file table.  The problem
> here is how to match/route such fd passing to appropriate processes in
> appropriate namespaces.  I think this problem is shared by all
> kernel-to-userspace upcalls, which I hope we can discuss at LSF/MM.

Almost made me wish I was coming to LFS/MM :)

> I don't think kernel fds are very special as compared to userspace fds.
Chuck Lever April 28, 2022, 9:54 p.m. UTC | #26
> On Apr 28, 2022, at 5:08 PM, Jakub Kicinski <kuba@kernel.org> wrote:
> 
> On Thu, 28 Apr 2022 01:29:10 +0000 Chuck Lever III wrote:
>>> Is it possible to instead create a fd-passing-like structured message
>>> which could carry the fd and all the relevant context (what goes 
>>> via the getsockopt() now)?
>>> 
>>> The user space agent can open such upcall socket, then bind to
>>> whatever entity it wants to talk to on the kernel side and read
>>> the notifications via recv()?  
>> 
>> We considered this kind of design. A reasonable place to start there
>> would be to fabricate new NETLINK messages to do this. I don't see
>> much benefit over what is done now, it's just a different isomer of
>> syntactic sugar, but it could be considered.
>> 
>> The issue is how the connected socket is materialized in user space.
>> accept(2) is the historical way to instantiate an already connected
>> socket in a process's file table, and seems like a natural fit. When
>> the handshake agent is done with the handshake, it closes the socket.
>> This invokes the tlsh_release() function which can check 
> 
> Actually - is that strictly necessary? It seems reasonable for NFS to
> check that it got TLS, since that's what it explicitly asks for per
> standard. But it may not always be the goal. In large data center
> networks there can be a policy the user space agent consults to choose
> what security to install. It may end up doing the auth but not enable
> crypto if confidentiality is deemed unnecessary.

> Obviously you may not have those requirements but if we can cover them
> without extra complexity it'd be great.

We can be flexible about how/where handshake success is checked.

However, using a simple close(2) to signal that the handshake
has completed does not communicate whether the handshake was
indeed successful. We will need a (richer) return/error code
from the handshake agent for that use case.


>> whether the IV implantation was successful.
> 
> I'm used to IV meaning Initialization Vector in context of crypto,
> what does "IV implementation" stand for?

Implantation, not implementation. The handshake agent implants
the initialization vector in the socket before it closes it.


--
Chuck Lever
Hannes Reinecke April 29, 2022, 6:25 a.m. UTC | #27
On 4/28/22 17:24, Chuck Lever III wrote:
> 
> 
>> On Apr 28, 2022, at 4:49 AM, Boris Pismenny <borispismenny@gmail.com> wrote:
>>
>> On 18/04/2022 19:49, Chuck Lever wrote:
>>> In-kernel TLS consumers need a way to perform a TLS handshake. In
>>> the absence of a handshake implementation in the kernel itself, a
>>> mechanism to perform the handshake in user space, using an existing
>>> TLS handshake library, is necessary.
>>>
>>> I've designed a way to pass a connected kernel socket endpoint to
>>> user space using the traditional listen/accept mechanism. accept(2)
>>> gives us a well-understood way to materialize a socket endpoint as a
>>> normal file descriptor in a specific user space process. Like any
>>> open socket descriptor, the accepted FD can then be passed to a
>>> library such as openSSL to perform a TLS handshake.
>>>
>>> This prototype currently handles only initiating client-side TLS
>>> handshakes. Server-side handshakes and key renegotiation are left
>>> to do.
>>>
>>> Security Considerations
>>> ~~~~~~~~ ~~~~~~~~~~~~~~
>>>
>>> This prototype is net-namespace aware.
>>>
>>> The kernel has no mechanism to attest that the listening user space
>>> agent is trustworthy.
>>>
>>> Currently the prototype does not handle multiple listeners that
>>> overlap -- multiple listeners in the same net namespace that have
>>> overlapping bind addresses.
>>>
>>
>> Thanks for posting this. As we discussed offline, I think this approach
>> is more manageable compared to a full in-kernel TLS handshake. A while
>> ago, I've hacked around TLS to implement the data-path for NVMe-TLS and
>> the data-path is indeed very simple provided an infrastructure such as
>> this one.
>>
>> Making this more generic is desirable, and this obviously requires
>> supporting multiple listeners for multiple protocols (TLS, DTLS, QUIC,
>> PSP, etc.), which suggests that it will reside somewhere outside of net/tls.
>> Moreover, there is a need to support (TLS) control messages here too.
>> These will occasionally require going back to the userspace daemon
>> during kernel packet processing. A few examples are handling: TLS rekey,
>> TLS close_notify, and TLS keepalives. I'm not saying that we need to
>> support everything from day-1, but there needs to be a way to support these.
> 
> I agree that control messages need to be handled as well. For the
> moment, the prototype simply breaks the connection when a control
> message is encountered, and a new session is negotiated. That of
> course is not the desired long-term solution.
> 
> If we believe that control messages are going to be distinct for
> each transport security layer, then perhaps we cannot make the
> handshake mechanism generic -- it will have to be specific to
> each security layer. Just a thought.
> 
> 
>> A related kernel interface is the XFRM netlink where the kernel asks a
>> userspace daemon to perform an IKE handshake for establishing IPsec SAs.
>> This works well when the handshake runs on a different socket, perhaps
>> that interface can be extended to do handshakes on a given socket that
>> lives in the kernel without actually passing the fd to userespace. If we
>> avoid instantiating a full socket fd in userspace, then the need for an
>> accept(2) interface is reduced, right?
> 
> Certainly piping the handshake messages up to user space instead
> of handing off a socket is possible. The TLS libraries would need
> to tolerate this, and GnuTLS (at least) appears OK with performing
> a handshake on an AF_TLSH socket.
> 
Yeah, and I guess that'll be the hard part.
We would need to design an entirely data path for gnutls when going down 
that path.
The beauty of the fd-passing idea is that gnutls (and openssl for that 
matter) will 'just work' (tm), without us have to do larger surgery there.
Just for reference, I've raised an issue with gnutls to accept long 
identifiers in TLS 1.3 (issue #1323), which is required for 
NVMe-over-TLS support. That one is lingering for over two months now.
And that's a relatively simple change; I don't want to imagine how long 
it'd take to try to push in a larger redesign...

Cheers,

Hannes
Chuck Lever April 29, 2022, 3:19 p.m. UTC | #28
> On Apr 28, 2022, at 9:12 AM, Simo Sorce <simo@redhat.com> wrote:
> 
> On Thu, 2022-04-28 at 11:49 +0300, Boris Pismenny wrote:
>> On 18/04/2022 19:49, Chuck Lever wrote:
>>> In-kernel TLS consumers need a way to perform a TLS handshake. In
>>> the absence of a handshake implementation in the kernel itself, a
>>> mechanism to perform the handshake in user space, using an existing
>>> TLS handshake library, is necessary.
>>> 
>>> I've designed a way to pass a connected kernel socket endpoint to
>>> user space using the traditional listen/accept mechanism. accept(2)
>>> gives us a well-understood way to materialize a socket endpoint as a
>>> normal file descriptor in a specific user space process. Like any
>>> open socket descriptor, the accepted FD can then be passed to a
>>> library such as openSSL to perform a TLS handshake.
>>> 
>>> This prototype currently handles only initiating client-side TLS
>>> handshakes. Server-side handshakes and key renegotiation are left
>>> to do.
>>> 
>>> Security Considerations
>>> ~~~~~~~~ ~~~~~~~~~~~~~~
>>> 
>>> This prototype is net-namespace aware.
>>> 
>>> The kernel has no mechanism to attest that the listening user space
>>> agent is trustworthy.
>>> 
>>> Currently the prototype does not handle multiple listeners that
>>> overlap -- multiple listeners in the same net namespace that have
>>> overlapping bind addresses.
>>> 
>> 
>> Thanks for posting this. As we discussed offline, I think this approach
>> is more manageable compared to a full in-kernel TLS handshake. A while
>> ago, I've hacked around TLS to implement the data-path for NVMe-TLS and
>> the data-path is indeed very simple provided an infrastructure such as
>> this one.
>> 
>> Making this more generic is desirable, and this obviously requires
>> supporting multiple listeners for multiple protocols (TLS, DTLS, QUIC,
>> PSP, etc.), which suggests that it will reside somewhere outside of net/tls.
>> Moreover, there is a need to support (TLS) control messages here too.
>> These will occasionally require going back to the userspace daemon
>> during kernel packet processing. A few examples are handling: TLS rekey,
>> TLS close_notify, and TLS keepalives. I'm not saying that we need to
>> support everything from day-1, but there needs to be a way to support these.
>> 
>> A related kernel interface is the XFRM netlink where the kernel asks a
>> userspace daemon to perform an IKE handshake for establishing IPsec SAs.
>> This works well when the handshake runs on a different socket, perhaps
>> that interface can be extended to do handshakes on a given socket that
>> lives in the kernel without actually passing the fd to userespace. If we
>> avoid instantiating a full socket fd in userspace, then the need for an
>> accept(2) interface is reduced, right?
> 
> JFYI:
> For in kernel NFSD hadnshakes we also use the gssproxy unix socket in
> the kernel, which allows GSSAPI handshakes to be relayed from the
> kernel to a user space listening daemon.
> 
> The infrastructure is technically already available and could be
> reasonably simply extended to do TLS negotiations as well.

To fill in a little about our design thinking:

We chose not to use either gssproxy or gssd for the TLS handshake
prototype so that we don't add a dependency on RPC infrastructure
for other TLS consumers such as NVMe. Non-RPC consumers view that
kind of dependency as quite undesirable.

Also, neither of those existing mechanisms helped us address the
issue of passing a connected socket endpoint.
listen/poll/accept/close addresses that issue quite directly.


--
Chuck Lever
Ilya Maximets May 24, 2022, 10:05 a.m. UTC | #29
On 4/28/22 23:08, Jakub Kicinski wrote:
> On Thu, 28 Apr 2022 10:09:17 -0400 Benjamin Coddington wrote:
>>> Noob reply: wish I knew.  (I somewhat hoped _you_ would've been able to
>>> tell me.)
>>>
>>> Thing is, the only method I could think of for fd passing is the POSIX fd
>>> passing via unix_attach_fds()/unix_detach_fds().  But that's AF_UNIX,
>>> which really is designed for process-to-process communication, not
>>> process-to-kernel.  So you probably have to move a similar logic over to
>>> AF_NETLINK. And design a new interface on how fds should be passed over
>>> AF_NETLINK.
>>>
>>> But then you have to face the issue that AF_NELINK is essentially UDP, and
>>> you have _no_ idea if and how many processes do listen on the other end.
>>> Thing is, you (as the sender) have to copy the fd over to the receiving
>>> process, so you'd better _hope_ there is a receiving process.  Not to
>>> mention that there might be several processes listening in...
> 
> Sort of. I double checked the netlink upcall implementations we have,
> they work by user space entity "registering" their netlink address
> (portid) at startup. Kernel then directs the upcalls to that address.
> But AFAICT there's currently no way for the netlink "server" to see
> when a "client" goes away, which makes me slightly uneasy about using
> such schemes for security related stuff. The user agent may crash and
> something else could grab the same address, I think.
> 
> Let me CC OvS who uses it the most, perhaps I'm missing a trick.

I don't think there are any tricks.  From what I see OVS creates
several netlink sockets, connects them to the kernel (nl_pid = 0)
and obtains their nl_pid's from the kernel.
These pids are either just a task_tgid_vnr() or a random negative
value from the [S32_MIN, -4096] range.  After that OVS "registers"
those pids in the openvswitch kernel module.  That just means sending
an array of integers to the kernel.  Kernel will later use these
integer pids to find the socket and send data to the userspace.

openvswitch module inside the kernel has no way to detect that
socket with a certain pid no longer exists.  So, it will continue
to try to find the socket and send, even if the user-space process
is dead.

So, if you can find a way to reliably create a process with the
same task_tgid or trick the randomizer inside the netlink_autobind(),
you can start receiving upcalls from the kernel in a new process,
IIUC.  Also, netlink_bind() allows to just specify the nl_pid
for listening sockets.  That might be another way.

> 
> My thinking was to use the netlink attribute format (just to reuse the
> helpers and parsing, but we can invent a new TLV format if needed) but
> create a new socket type specifically for upcalls.
> 
>>> And that's something I _definitely_ don't feel comfortable with without
>>> guidance from the networking folks, so I didn't pursue it further and we
>>> went with the 'accept()' mechanism Chuck implemented.
>>>
>>> I'm open to suggestions, though.  
>>
>> EXPORT_SYMBOL(receive_fd) would allow interesting implementations.
>>
>> The kernel keyring facilities have a good API for creating various key_types
>> which are able to perform work such as this from userspace contexts.
>>
>> I have a working prototype for a keyring key instantiation which allows a
>> userspace process to install a kernel fd on its file table.  The problem
>> here is how to match/route such fd passing to appropriate processes in
>> appropriate namespaces.  I think this problem is shared by all
>> kernel-to-userspace upcalls, which I hope we can discuss at LSF/MM.
> 
> Almost made me wish I was coming to LFS/MM :)
> 
>> I don't think kernel fds are very special as compared to userspace fds.
> _______________________________________________
> dev mailing list
> dev@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
>
diff mbox series

Patch

diff --git a/Documentation/networking/tls-in-kernel-handshake.rst b/Documentation/networking/tls-in-kernel-handshake.rst
new file mode 100644
index 000000000000..73ed6928f4b2
--- /dev/null
+++ b/Documentation/networking/tls-in-kernel-handshake.rst
@@ -0,0 +1,103 @@ 
+.. _kernel_tls:
+
+=======================
+In-Kernel TLS Handshake
+=======================
+
+Overview
+========
+
+Transport Layer Security (TLS) is a Upper Layer Protocol (ULP) that runs over
+TCP. TLS provides end-to-end data integrity and confidentiality.
+
+kTLS handles the TLS record subprotocol, but does not handle the TLS handshake
+subprotocol, used to establish a TLS session. In user space, a TLS library
+performs the handshake on a socket which is converted to kTLS operation. In
+the kernel it is much the same. The TLS handshake is done in user space by a
+library TLS implementation.
+
+
+User agent
+==========
+
+With the current implementation, a user agent is started in each network
+namespace where a kernel consumer might require a TLS handshake. This agent
+listens on an AF_TLSH socket for requests from the kernel to perform a
+handshake on an open and connected TCP socket.
+
+The open socket is passed to user space via accept(), which creates a file
+descriptor. If the handshake completes successfully, the user agent promotes
+the socket to use the TLS ULP and sets the session information using the
+SOL_TLS socket options. The user agent returns the socket to the kernel by
+closing the accepted file descriptor.
+
+
+Kernel Handshake API
+====================
+
+A kernel consumer initiates a client-side TLS handshake on an open
+socket by invoking one of the tls_client_hello() functions. For
+example:
+
+.. code-block:: c
+
+  ret = tls_client_hello_x509(sock, done_func, cookie, priorities,
+                              peerid, cert);
+
+The function returns zero when the handshake request is under way. A
+zero return guarantees the callback function @done_func will be invoked
+for this socket.
+
+The function returns a negative errno if the handshake could not be
+started. A negative errno guarantees the callback function @done_func
+will not be invoked on this socket.
+
+The @sock argument is an open and connected IPPROTO_TCP socket. The
+caller must hold a reference on the socket to prevent it from being
+destroyed while the handshake is in progress.
+
+@done_func and @cookie are a callback function that is invoked when the
+handshake has completed (either successfully or not). The success status
+of the handshake is returned via the @status parameter of the callback
+function. A good practice is to close and destroy the socket immediately
+if the handshake has failed.
+
+@priorities is a GnuTLS priorities string that controls the handshake.
+The special value TLSH_DEFAULT_PRIORITIES causes the handshake to
+operate using user space configured default TLS priorities. However,
+the caller can use the string to (for example) adjust the handshake to
+use a restricted set of ciphers (say, if the kernel is in FIPS mode or
+the kernel consumer wants to mandate only a limited set of ciphers).
+
+@peerid is the serial number of a key on the XXXYYYZZZ keyring that
+contains a private key.
+
+@cert is the serial number of a key on the XXXYYYYZZZ keyring that
+contains a {PEM,DER} format x.509 certificate that the user agent
+presents to the server as the local peer's identity.
+
+To initiate a client-side TLS handshake with a pre-shared key, use:
+
+.. code-block:: c
+
+  ret = tls_client_hello_psk(sock, done_func, cookie, priorities,
+                             peerid);
+
+@peerid is the serial number of a key on the XXXYYYZZZ keyring that
+contains the pre-shared key.
+
+The other parameters are as above.
+
+
+Other considerations
+--------------------
+
+While the handshake is under way, the kernel consumer must alter the
+socket's sk_data_ready callback function to ignore incoming data.
+Once the callback function has been invoked, normal receive operation
+can be resumed.
+
+See tls.rst for details on how a kTLS consumer recognizes incoming
+(decrypted) application data, alerts, and handshake packets once the
+socket has been promoted to use the TLS ULP.
+
diff --git a/include/linux/socket.h b/include/linux/socket.h
index fc28c68e6b5f..69acb5668d34 100644
--- a/include/linux/socket.h
+++ b/include/linux/socket.h
@@ -369,6 +369,7 @@  struct ucred {
 #define SOL_MPTCP	284
 #define SOL_MCTP	285
 #define SOL_SMC		286
+#define SOL_TLSH	287
 
 /* IPX options */
 #define IPX_TYPE	1
diff --git a/include/net/sock.h b/include/net/sock.h
index d2a513169527..d5a5d5fd6682 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -353,6 +353,7 @@  struct sk_filter;
   *	@sk_txtime_report_errors: set report errors mode for SO_TXTIME
   *	@sk_txtime_unused: unused txtime flags
   *	@ns_tracker: tracker for netns reference
+  *	@sk_tlsh_priv: private data for TLS handshake upcall
   */
 struct sock {
 	/*
@@ -544,6 +545,8 @@  struct sock {
 #endif
 	struct rcu_head		sk_rcu;
 	netns_tracker		ns_tracker;
+
+	void			*sk_tlsh_priv;
 };
 
 enum sk_pacing {
diff --git a/include/net/tls.h b/include/net/tls.h
index b6968a5b5538..6b1bf46daa34 100644
--- a/include/net/tls.h
+++ b/include/net/tls.h
@@ -51,6 +51,18 @@ 
 #include <uapi/linux/tls.h>
 
 
+struct tlsh_sock {
+	/* struct sock must remain the first field */
+	struct sock	th_sk;
+
+	int		th_bind_family;
+};
+
+static inline struct tlsh_sock *tlsh_sk(struct sock *sk)
+{
+	return (struct tlsh_sock *)sk;
+}
+
 /* Maximum data size carried in a TLS record */
 #define TLS_MAX_PAYLOAD_SIZE		((size_t)1 << 14)
 
@@ -356,6 +368,9 @@  struct tls_context *tls_ctx_create(struct sock *sk);
 void tls_ctx_free(struct sock *sk, struct tls_context *ctx);
 void update_sk_prot(struct sock *sk, struct tls_context *ctx);
 
+int tlsh_pf_create(struct net *net, struct socket *sock, int protocol,
+		   int kern);
+
 int wait_on_pending_writer(struct sock *sk, long *timeo);
 int tls_sk_query(struct sock *sk, int optname, char __user *optval,
 		int __user *optlen);
diff --git a/include/net/tlsh.h b/include/net/tlsh.h
new file mode 100644
index 000000000000..8725fd83df60
--- /dev/null
+++ b/include/net/tlsh.h
@@ -0,0 +1,22 @@ 
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * PF_TLSH protocol family socket handler.
+ *
+ * Author: Chuck Lever <chuck.lever@oracle.com>
+ *
+ * Copyright (c) 2021, Oracle and/or its affiliates.
+ */
+
+#ifndef _TLS_HANDSHAKE_H
+#define _TLS_HANDSHAKE_H
+
+extern int tls_client_hello_psk(struct socket *sock,
+				void (*done)(void *data, int status),
+				void *data, const char *priorities,
+				key_serial_t peerid);
+extern int tls_client_hello_x509(struct socket *sock,
+				 void (*done)(void *data, int status),
+				 void *data, const char *priorities,
+				 key_serial_t peerid, key_serial_t cert);
+
+#endif /* _TLS_HANDSHAKE_H */
diff --git a/include/uapi/linux/tls.h b/include/uapi/linux/tls.h
index 5f38be0ec0f3..d0ffbb6ea0e4 100644
--- a/include/uapi/linux/tls.h
+++ b/include/uapi/linux/tls.h
@@ -40,6 +40,22 @@ 
 #define TLS_TX			1	/* Set transmit parameters */
 #define TLS_RX			2	/* Set receive parameters */
 
+/* TLSH socket options */
+#define TLSH_PRIORITIES		1	/* Retrieve TLS priorities string */
+#define TLSH_PEERID		2	/* Retrieve peer identity */
+#define TLSH_HANDSHAKE_TYPE	3	/* Retrieve handshake type */
+#define TLSH_X509_CERTIFICATE	4	/* Retrieve x.509 certificate */
+
+#define TLSH_DEFAULT_PRIORITIES		(NULL)
+#define TLSH_NO_PEERID			(0)
+#define TLSH_NO_CERT			(0)
+
+/* TLSH handshake types */
+enum tlsh_hs_type {
+	TLSH_TYPE_CLIENTHELLO_X509,
+	TLSH_TYPE_CLIENTHELLO_PSK,
+};
+
 /* Supported versions */
 #define TLS_VERSION_MINOR(ver)	((ver) & 0xFF)
 #define TLS_VERSION_MAJOR(ver)	(((ver) >> 8) & 0xFF)
diff --git a/net/core/sock.c b/net/core/sock.c
index 81bc14b67468..d9f700e5ea1a 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -3295,6 +3295,8 @@  void sock_init_data(struct socket *sock, struct sock *sk)
 	sk->sk_incoming_cpu = -1;
 	sk->sk_txrehash = SOCK_TXREHASH_DEFAULT;
 
+	sk->sk_tlsh_priv = NULL;
+
 	sk_rx_queue_clear(sk);
 	/*
 	 * Before updating sk_refcnt, we must commit prior changes to memory
diff --git a/net/tls/Makefile b/net/tls/Makefile
index f1ffbfe8968d..d159a03b94f3 100644
--- a/net/tls/Makefile
+++ b/net/tls/Makefile
@@ -7,7 +7,7 @@  CFLAGS_trace.o := -I$(src)
 
 obj-$(CONFIG_TLS) += tls.o
 
-tls-y := tls_main.o tls_sw.o tls_proc.o trace.o
+tls-y := af_tlsh.o tls_main.o tls_sw.o tls_proc.o trace.o
 
 tls-$(CONFIG_TLS_TOE) += tls_toe.o
 tls-$(CONFIG_TLS_DEVICE) += tls_device.o tls_device_fallback.o
diff --git a/net/tls/af_tlsh.c b/net/tls/af_tlsh.c
new file mode 100644
index 000000000000..4d1c1de3a474
--- /dev/null
+++ b/net/tls/af_tlsh.c
@@ -0,0 +1,1040 @@ 
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * PF_TLSH protocol family socket handler.
+ *
+ * Author: Chuck Lever <chuck.lever@oracle.com>
+ *
+ * Copyright (c) 2021, Oracle and/or its affiliates.
+ *
+ * When a kernel TLS consumer wants to establish a TLS session, it
+ * makes an AF_TLSH Listener ready. When user space accepts on that
+ * listener, the kernel fabricates a user space socket endpoint on
+ * which a user space TLS library can perform the TLS handshake.
+ *
+ * Closing the user space descriptor signals to the kernel that the
+ * library handshake process is complete. If the library has managed
+ * to initialize the socket's TLS crypto_info, the kernel marks the
+ * handshake as a success.
+ */
+
+/*
+ * Socket reference counting
+ *  A: listener socket initial reference
+ *  B: listener socket on the global listener list
+ *  C: listener socket while a ready AF_INET(6) socket is enqueued
+ *  D: listener socket while its accept queue is drained
+ *
+ *  I: ready AF_INET(6) socket waiting on a listener's accept queue
+ *  J: ready AF_INET(6) socket with a consumer waiting for a completion callback
+ */
+
+#include <linux/types.h>
+#include <linux/socket.h>
+#include <linux/in.h>
+#include <linux/kernel.h>
+#include <linux/poll.h>
+#include <linux/module.h>
+#include <linux/slab.h>
+#include <linux/skbuff.h>
+#include <linux/inet.h>
+
+#include <net/ip.h>
+#include <net/ipv6.h>
+#include <net/tcp.h>
+#include <net/protocol.h>
+#include <net/sock.h>
+#include <net/inet_common.h>
+#include <net/net_namespace.h>
+#include <net/tls.h>
+#include <net/tlsh.h>
+
+#include "trace.h"
+
+
+struct tlsh_sock_info {
+	enum tlsh_hs_type	tsi_handshake_type;
+
+	void			(*tsi_handshake_done)(void *data, int status);
+	void			*tsi_handshake_data;
+	char			*tsi_tls_priorities;
+	key_serial_t		tsi_peerid;
+	key_serial_t		tsi_certificate;
+
+	struct socket_wq	*tsi_saved_wq;
+	struct socket		*tsi_saved_socket;
+	kuid_t			tsi_saved_uid;
+};
+
+static void tlsh_sock_info_destroy(struct tlsh_sock_info *info)
+{
+	kfree(info->tsi_tls_priorities);
+	kfree(info);
+}
+
+static DEFINE_RWLOCK(tlsh_listener_lock);
+static HLIST_HEAD(tlsh_listeners);
+
+static void tlsh_register_listener(struct sock *sk)
+{
+	write_lock_bh(&tlsh_listener_lock);
+	sk_add_node(sk, &tlsh_listeners);	/* Ref: B */
+	write_unlock_bh(&tlsh_listener_lock);
+}
+
+static void tlsh_unregister_listener(struct sock *sk)
+{
+	write_lock_bh(&tlsh_listener_lock);
+	sk_del_node_init(sk);			/* Ref: B */
+	write_unlock_bh(&tlsh_listener_lock);
+}
+
+/**
+ * tlsh_find_listener - find listener that matches an incoming connection
+ * @net: net namespace to match
+ * @family: address family to match
+ *
+ * Return values:
+ *   On success, address of a listening AF_TLSH socket
+ *   %NULL: No matching listener found
+ */
+static struct sock *tlsh_find_listener(struct net *net, unsigned short family)
+{
+	struct sock *listener;
+
+	read_lock(&tlsh_listener_lock);
+
+	sk_for_each(listener, &tlsh_listeners) {
+		if (sock_net(listener) != net)
+			continue;
+		if (tlsh_sk(listener)->th_bind_family != AF_UNSPEC &&
+		    tlsh_sk(listener)->th_bind_family != family)
+			continue;
+
+		sock_hold(listener);	/* Ref: C */
+		goto out;
+	}
+	listener = NULL;
+
+out:
+	read_unlock(&tlsh_listener_lock);
+	return listener;
+}
+
+/**
+ * tlsh_accept_enqueue - add a socket to a listener's accept_q
+ * @listener: listening socket
+ * @sk: socket to enqueue on @listener
+ *
+ * Return values:
+ *   On success, returns 0
+ *   %-ENOMEM: Memory for skbs has been exhausted
+ */
+static int tlsh_accept_enqueue(struct sock *listener, struct sock *sk)
+{
+	struct sk_buff *skb;
+
+	skb = alloc_skb(0, GFP_KERNEL);
+	if (!skb)
+		return -ENOMEM;
+
+	sock_hold(sk);	/* Ref: I */
+	skb->sk = sk;
+	skb_queue_tail(&listener->sk_receive_queue, skb);
+	sk_acceptq_added(listener);
+	listener->sk_data_ready(listener);
+	return 0;
+}
+
+/**
+ * tlsh_accept_dequeue - remove a socket from a listener's accept_q
+ * @listener: listener socket to check
+ *
+ * Caller guarantees that @listener won't disappear.
+ *
+ * Return values:
+ *   On success, return a TCP socket waiting for TLS service
+ *   %NULL: No sockets on the accept queue
+ */
+static struct sock *tlsh_accept_dequeue(struct sock *listener)
+{
+	struct sk_buff *skb;
+	struct sock *sk;
+
+	skb = skb_dequeue(&listener->sk_receive_queue);
+	if (!skb)
+		return NULL;
+	sk_acceptq_removed(listener);
+	sock_put(listener);	/* Ref: C */
+
+	sk = skb->sk;
+	skb->sk = NULL;
+	kfree_skb(skb);
+	sock_put(sk);	/* Ref: I */
+	return sk;
+}
+
+static void tlsh_sock_save(struct sock *sk,
+			   struct tlsh_sock_info *info)
+{
+	sock_hold(sk);	/* Ref: J */
+
+	write_lock_bh(&sk->sk_callback_lock);
+	info->tsi_saved_wq = sk->sk_wq_raw;
+	info->tsi_saved_socket = sk->sk_socket;
+	info->tsi_saved_uid = sk->sk_uid;
+	sk->sk_tlsh_priv = info;
+	write_unlock_bh(&sk->sk_callback_lock);
+}
+
+static void tlsh_sock_clear(struct sock *sk)
+{
+	struct tlsh_sock_info *info = sk->sk_tlsh_priv;
+
+	write_lock_bh(&sk->sk_callback_lock);
+	sk->sk_tlsh_priv = NULL;
+	write_unlock_bh(&sk->sk_callback_lock);
+	tlsh_sock_info_destroy(info);
+	sock_put(sk);	/* Ref: J (err) */
+}
+
+static void tlsh_sock_restore_locked(struct sock *sk)
+{
+	struct tlsh_sock_info *info = sk->sk_tlsh_priv;
+
+	sk->sk_wq_raw = info->tsi_saved_wq;
+	sk->sk_socket = info->tsi_saved_socket;
+	sk->sk_uid = info->tsi_saved_uid;
+	sk->sk_tlsh_priv = NULL;
+}
+
+static bool tlsh_crypto_info_initialized(struct sock *sk)
+{
+	struct tls_context *ctx = tls_get_ctx(sk);
+
+	return ctx != NULL &&
+		TLS_CRYPTO_INFO_READY(&ctx->crypto_send.info) &&
+		TLS_CRYPTO_INFO_READY(&ctx->crypto_recv.info);
+}
+
+/**
+ * tlsh_handshake_done - call the registered "done" callback for @sk.
+ * @sk: socket that was requesting a handshake
+ *
+ * Return values:
+ *   %true:  Handshake callback was called
+ *   %false: No handshake callback was set, no-op
+ */
+static bool tlsh_handshake_done(struct sock *sk)
+{
+	struct tlsh_sock_info *info;
+	void (*done)(void *data, int status);
+	void *data;
+
+	write_lock_bh(&sk->sk_callback_lock);
+	info = sk->sk_tlsh_priv;
+	if (info) {
+		done = info->tsi_handshake_done;
+		data = info->tsi_handshake_data;
+
+		tlsh_sock_restore_locked(sk);
+
+		if (tlsh_crypto_info_initialized(sk)) {
+			done(data, 0);
+		} else {
+			done(data, -EACCES);
+		}
+	}
+	write_unlock_bh(&sk->sk_callback_lock);
+
+	if (info) {
+		tlsh_sock_info_destroy(info);
+		sock_put(sk);	/* Ref: J */
+		return true;
+	}
+	return false;
+}
+
+/**
+ * tlsh_accept_drain - clean up children queued for accept
+ * @listener: listener socket to drain
+ *
+ */
+static void tlsh_accept_drain(struct sock *listener)
+{
+	struct sock *sk;
+
+	while ((sk = tlsh_accept_dequeue(listener)))
+		tlsh_handshake_done(sk);
+}
+
+/**
+ * tlsh_release - free an AF_TLSH socket
+ * @sock: socket to release
+ *
+ * Return values:
+ *   %0: success
+ */
+static int tlsh_release(struct socket *sock)
+{
+	struct sock *sk = sock->sk;
+	struct tlsh_sock *tsk = tlsh_sk(sk);
+
+	if (!sk)
+		return 0;
+
+	switch (sk->sk_family) {
+	case AF_INET:
+		if (!tlsh_handshake_done(sk))
+			return inet_release(sock);
+		return 0;
+#if IS_ENABLED(CONFIG_IPV6)
+	case AF_INET6:
+		if (!tlsh_handshake_done(sk))
+			return inet6_release(sock);
+		return 0;
+#endif
+	case AF_TLSH:
+		break;
+	default:
+		return 0;
+	}
+
+	sock_hold(sk);	/* Ref: D */
+	sock_orphan(sk);
+	lock_sock(sk);
+
+	tlsh_unregister_listener(sk);
+	tlsh_accept_drain(sk);
+
+	sk->sk_state = TCP_CLOSE;
+	sk->sk_shutdown |= SEND_SHUTDOWN;
+	sk->sk_state_change(sk);
+
+	tsk->th_bind_family = AF_UNSPEC;
+	sock->sk = NULL;
+	release_sock(sk);
+	sock_put(sk);	/* Ref: D */
+
+	sock_put(sk);	/* Ref: A */
+	return 0;
+}
+
+/**
+ * tlsh_bind - bind a name to an AF_TLSH socket
+ * @sock: socket to be bound
+ * @uaddr: address to bind to
+ * @addrlen: length in bytes of @uaddr
+ *
+ * Binding an AF_TLSH socket defines the family of addresses that
+ * are able to be accept(2)'d. So, AF_INET for ipv4, AF_INET6 for
+ * ipv6.
+ *
+ * Return values:
+ *   %0: binding was successful.
+ *   %-EPERM: Caller not privileged
+ *   %-EINVAL: Family of @sock or @uaddr not supported
+ */
+static int tlsh_bind(struct socket *sock, struct sockaddr *uaddr, int addrlen)
+{
+	struct sock *listener, *sk = sock->sk;
+	struct tlsh_sock *tsk = tlsh_sk(sk);
+
+	if (!capable(CAP_NET_BIND_SERVICE))
+		return -EPERM;
+
+	switch (uaddr->sa_family) {
+	case AF_INET:
+		if (addrlen != sizeof(struct sockaddr_in))
+			return -EINVAL;
+		break;
+#if IS_ENABLED(CONFIG_IPV6)
+	case AF_INET6:
+		if (addrlen != sizeof(struct sockaddr_in6))
+			return -EINVAL;
+		break;
+#endif
+	default:
+		return -EAFNOSUPPORT;
+	}
+
+	listener = tlsh_find_listener(sock_net(sk), uaddr->sa_family);
+	if (listener) {
+		sock_put(listener);	/* Ref: C */
+		return -EADDRINUSE;
+	}
+
+	tsk->th_bind_family = uaddr->sa_family;
+	return 0;
+}
+
+/**
+ * tlsh_accept - return a connection waiting for a TLS handshake
+ * @listener: listener socket which connection requests arrive on
+ * @newsock: socket to move incoming connection to
+ * @flags: SOCK_NONBLOCK and/or SOCK_CLOEXEC
+ * @kern: "boolean": 1 for kernel-internal sockets
+ *
+ * Return values:
+ *   %0: @newsock has been initialized.
+ *   %-EPERM: caller is not privileged
+ */
+static int tlsh_accept(struct socket *listener, struct socket *newsock, int flags,
+		       bool kern)
+{
+	struct sock *sk = listener->sk, *newsk;
+	DECLARE_WAITQUEUE(wait, current);
+	long timeo;
+	int rc;
+
+	rc = -EPERM;
+	if (!capable(CAP_NET_BIND_SERVICE))
+		goto out;
+
+	lock_sock(sk);
+
+	if (sk->sk_state != TCP_LISTEN) {
+		rc = -EBADF;
+		goto out_release;
+	}
+
+	timeo = sock_rcvtimeo(sk, flags & O_NONBLOCK);
+
+	rc = 0;
+	add_wait_queue_exclusive(sk_sleep(sk), &wait);
+	while (!(newsk = tlsh_accept_dequeue(sk))) {
+		set_current_state(TASK_INTERRUPTIBLE);
+		if (!timeo) {
+			rc = -EAGAIN;
+			break;
+		}
+		release_sock(sk);
+
+		timeo = schedule_timeout(timeo);
+
+		lock_sock(sk);
+		if (sk->sk_state != TCP_LISTEN) {
+			rc = -EBADF;
+			break;
+		}
+		if (signal_pending(current)) {
+			rc = sock_intr_errno(timeo);
+			break;
+		}
+	}
+	set_current_state(TASK_RUNNING);
+	remove_wait_queue(sk_sleep(sk), &wait);
+	if (rc) {
+		tlsh_handshake_done(sk);
+		goto out_release;
+	}
+
+	sock_graft(newsk, newsock);
+
+out_release:
+	release_sock(sk);
+out:
+	return rc;
+}
+
+/**
+ * tlsh_getname - retrieve src/dst address information from an AF_TLSH socket
+ * @sock: socket to query
+ * @uaddr: buffer to fill in
+ * @peer: value indicates which address to retrieve
+ *
+ * Return values:
+ *   On success, a positive length of the address in @uaddr
+ *   On error, a negative errno
+ */
+static int tlsh_getname(struct socket *sock, struct sockaddr *uaddr, int peer)
+{
+	struct sock *sk = sock->sk;
+
+	switch (sk->sk_family) {
+	case AF_INET:
+		return inet_getname(sock, uaddr, peer);
+#if IS_ENABLED(CONFIG_IPV6)
+	case AF_INET6:
+		return inet6_getname(sock, uaddr, peer);
+#endif
+	default:
+		return -EOPNOTSUPP;
+	}
+}
+
+/**
+ * tlsh_poll - check for data ready on an AF_TLSH socket
+ * @file: file to check for work
+ * @sock: socket associated with @file
+ * @wait: poll table
+ *
+ * Return values:
+ *    A mask of flags indicating what type of I/O is ready
+ */
+static __poll_t tlsh_poll(struct file *file, struct socket *sock,
+			  poll_table *wait)
+{
+	struct sock *sk = sock->sk;
+	__poll_t mask;
+
+	sock_poll_wait(file, sock, wait);
+
+	mask = 0;
+
+	if (sk->sk_state == TCP_LISTEN) {
+		if (!skb_queue_empty_lockless(&sk->sk_receive_queue))
+			mask |= EPOLLIN | EPOLLRDNORM;
+		if (sk_is_readable(sk))
+			mask |= EPOLLIN | EPOLLRDNORM;
+		return mask;
+	}
+
+	if (sk->sk_shutdown == SHUTDOWN_MASK || sk->sk_state == TCP_CLOSE)
+		mask |= EPOLLHUP;
+	if (sk->sk_shutdown & RCV_SHUTDOWN)
+		mask |= EPOLLIN | EPOLLRDNORM | EPOLLRDHUP;
+
+	if (!skb_queue_empty_lockless(&sk->sk_receive_queue))
+		mask |= EPOLLIN | EPOLLRDNORM;
+	if (sk_is_readable(sk))
+		mask |= EPOLLIN | EPOLLRDNORM;
+
+	/* This barrier is coupled with smp_wmb() in tcp_reset() */
+	smp_rmb();
+	if (sk->sk_err || !skb_queue_empty_lockless(&sk->sk_error_queue))
+		mask |= EPOLLERR;
+
+	return mask;
+}
+
+/**
+ * tlsh_listen - move an AF_TLSH socket into a listening state
+ * @sock: socket to transition to listening state
+ * @backlog: size of backlog queue
+ *
+ * Return values:
+ *   %0: @sock is now in a listening state
+ *   %-EPERM: caller is not privileged
+ *   %-EOPNOTSUPP: @sock is not of a type that supports the listen() operation
+ */
+static int tlsh_listen(struct socket *sock, int backlog)
+{
+	struct sock *sk = sock->sk;
+	unsigned char old_state;
+	int rc;
+
+	if (!capable(CAP_NET_BIND_SERVICE))
+		return -EPERM;
+
+	lock_sock(sk);
+
+	rc = -EOPNOTSUPP;
+	if (sock->state != SS_UNCONNECTED || sock->type != SOCK_STREAM)
+		goto out;
+	old_state = sk->sk_state;
+	if (!((1 << old_state) & (TCPF_CLOSE | TCPF_LISTEN)))
+		goto out;
+
+	sk->sk_max_ack_backlog = backlog;
+	sk->sk_state = TCP_LISTEN;
+	tlsh_register_listener(sk);
+
+	rc = 0;
+
+out:
+	release_sock(sk);
+	return rc;
+}
+
+/**
+ * tlsh_shutdown - Shutdown an AF_TLSH socket
+ * @sock: socket to shut down
+ * @how: mask
+ *
+ * Return values:
+ *   %0: Success
+ *   %-EINVAL: @sock is not of a type that supports a shutdown
+ */
+static int tlsh_shutdown(struct socket *sock, int how)
+{
+	struct sock *sk = sock->sk;
+
+	switch (sk->sk_family) {
+	case AF_INET:
+		break;
+#if IS_ENABLED(CONFIG_IPV6)
+	case AF_INET6:
+		break;
+#endif
+	default:
+		return -EINVAL;
+	}
+
+	return inet_shutdown(sock, how);
+}
+
+/**
+ * tlsh_setsockopt - Set a socket option on an AF_TLSH socket
+ * @sock: socket to act upon
+ * @level: which network layer to act upon
+ * @optname: which option to set
+ * @optval: new value to set
+ * @optlen: the size of the new value, in bytes
+ *
+ * Return values:
+ *   %0: Success
+ *   %-ENOPROTOOPT: The option is unknown at the level indicated.
+ */
+static int tlsh_setsockopt(struct socket *sock, int level, int optname,
+			   sockptr_t optval, unsigned int optlen)
+{
+	struct sock *sk = sock->sk;
+
+	switch (sk->sk_family) {
+	case AF_INET:
+		break;
+#if IS_ENABLED(CONFIG_IPV6)
+	case AF_INET6:
+		break;
+#endif
+	default:
+		return -ENOPROTOOPT;
+	}
+
+	return sock_common_setsockopt(sock, level, optname, optval, optlen);
+}
+
+static int tlsh_getsockopt_priorities(struct sock *sk, char __user *optval,
+				      int __user *optlen)
+{
+	struct tlsh_sock_info *info;
+	int outlen, len, ret;
+	const char *val;
+
+	if (get_user(len, optlen))
+		return -EFAULT;
+	if (!optval)
+		return -EINVAL;
+
+	ret = 0;
+
+	sock_hold(sk);
+	write_lock_bh(&sk->sk_callback_lock);
+
+	info = sk->sk_tlsh_priv;
+	if (info) {
+		val = info->tsi_tls_priorities;
+	} else {
+		write_unlock_bh(&sk->sk_callback_lock);
+		ret = -EBUSY;
+		goto out_put;
+	}
+
+	write_unlock_bh(&sk->sk_callback_lock);
+
+	if (val) {
+		outlen = strlen(val);
+		if (len < outlen)
+			ret = -EINVAL;
+		else if (copy_to_user(optval, val, outlen))
+			ret = -EFAULT;
+	} else {
+		outlen = 0;
+	}
+
+
+	if (put_user(outlen, optlen))
+		ret = -EFAULT;
+
+out_put:
+	sock_put(sk);
+	return ret;
+}
+
+static int tlsh_getsockopt_peerid(struct sock *sk, char __user *optval,
+				  int __user *optlen)
+{
+	struct tlsh_sock_info *info;
+	int len, val;
+
+	if (get_user(len, optlen))
+		return -EFAULT;
+	if (!optval || (len < sizeof(key_serial_t)))
+		return -EINVAL;
+
+	write_lock_bh(&sk->sk_callback_lock);
+	info = sk->sk_tlsh_priv;
+	if (info) {
+		val = info->tsi_peerid;
+	} else {
+		write_unlock_bh(&sk->sk_callback_lock);
+		return -EBUSY;
+	}
+	write_unlock_bh(&sk->sk_callback_lock);
+
+	if (put_user(len, optlen))
+		return -EFAULT;
+	if (copy_to_user(optval, &val, len))
+		return -EFAULT;
+	return 0;
+}
+
+static int tlsh_getsockopt_type(struct sock *sk, char __user *optval,
+				int __user *optlen)
+{
+	struct tlsh_sock_info *info;
+	int len, val;
+
+	if (get_user(len, optlen))
+		return -EFAULT;
+	if (!optval || (len < sizeof(key_serial_t)))
+		return -EINVAL;
+
+	write_lock_bh(&sk->sk_callback_lock);
+	info = sk->sk_tlsh_priv;
+	if (info) {
+		val = info->tsi_handshake_type;
+	} else {
+		write_unlock_bh(&sk->sk_callback_lock);
+		return -EBUSY;
+	}
+	write_unlock_bh(&sk->sk_callback_lock);
+
+	if (put_user(len, optlen))
+		return -EFAULT;
+	if (copy_to_user(optval, &val, len))
+		return -EFAULT;
+	return 0;
+}
+
+static int tlsh_getsockopt_cert(struct sock *sk, char __user *optval,
+				int __user *optlen)
+{
+	struct tlsh_sock_info *info;
+	int len, val;
+
+	if (get_user(len, optlen))
+		return -EFAULT;
+	if (!optval || (len < sizeof(key_serial_t)))
+		return -EINVAL;
+
+	write_lock_bh(&sk->sk_callback_lock);
+	info = sk->sk_tlsh_priv;
+	if (info) {
+		val = info->tsi_certificate;
+	} else {
+		write_unlock_bh(&sk->sk_callback_lock);
+		return -EBUSY;
+	}
+	write_unlock_bh(&sk->sk_callback_lock);
+
+	if (put_user(len, optlen))
+		return -EFAULT;
+	if (copy_to_user(optval, &val, len))
+		return -EFAULT;
+	return 0;
+}
+
+/**
+ * tlsh_getsockopt - Retrieve a socket option from an AF_TLSH socket
+ * @sock: socket to act upon
+ * @level: which network layer to act upon
+ * @optname: which option to retrieve
+ * @optval: a buffer into which to receive the option's value
+ * @optlen: the size of the receive buffer, in bytes
+ *
+ * Return values:
+ *   %0: Success
+ *   %-ENOPROTOOPT: The option is unknown at the level indicated.
+ *   %-EINVAL: Invalid argument
+ *   %-EFAULT: Output memory not write-able
+ *   %-EBUSY: Option value not available
+ */
+static int tlsh_getsockopt(struct socket *sock, int level, int optname,
+			   char __user *optval, int __user *optlen)
+{
+	struct sock *sk = sock->sk;
+	int ret;
+
+	switch (sk->sk_family) {
+	case AF_INET:
+		break;
+#if IS_ENABLED(CONFIG_IPV6)
+	case AF_INET6:
+		break;
+#endif
+	default:
+		return -ENOPROTOOPT;
+	}
+
+	if (level != SOL_TLSH)
+		return sock_common_getsockopt(sock, level, optname, optval, optlen);
+
+	switch (optname) {
+	case TLSH_PRIORITIES:
+		ret = tlsh_getsockopt_priorities(sk, optval, optlen);
+		break;
+	case TLSH_PEERID:
+		ret = tlsh_getsockopt_peerid(sk, optval, optlen);
+		break;
+	case TLSH_HANDSHAKE_TYPE:
+		ret = tlsh_getsockopt_type(sk, optval, optlen);
+		break;
+	case TLSH_X509_CERTIFICATE:
+		ret = tlsh_getsockopt_cert(sk, optval, optlen);
+		break;
+	default:
+		ret = -ENOPROTOOPT;
+	}
+
+	return ret;
+}
+
+/**
+ * tlsh_sendmsg - Send a message on an AF_TLSH socket
+ * @sock: socket to send on
+ * @msg: message to send
+ * @size: size of message, in bytes
+ *
+ * Return values:
+ *   %0: Success
+ *   %-EOPNOTSUPP: Address family does not support this operation
+ */
+static int tlsh_sendmsg(struct socket *sock, struct msghdr *msg, size_t size)
+{
+	struct sock *sk = sock->sk;
+
+	switch (sk->sk_family) {
+	case AF_INET:
+		break;
+#if IS_ENABLED(CONFIG_IPV6)
+	case AF_INET6:
+		break;
+#endif
+	default:
+		return -EOPNOTSUPP;
+	}
+
+	if (unlikely(inet_send_prepare(sk)))
+		return -EAGAIN;
+	return sk->sk_prot->sendmsg(sk, msg, size);
+}
+
+/**
+ * tlsh_recvmsg - Receive a message from an AF_TLSH socket
+ * @sock: socket to receive from
+ * @msg: buffer into which to receive
+ * @size: size of buffer, in bytes
+ * @flags: control settings
+ *
+ * Return values:
+ *   %0: Success
+ *   %-EOPNOTSUPP: Address family does not support this operation
+ */
+static int tlsh_recvmsg(struct socket *sock, struct msghdr *msg, size_t size,
+			int flags)
+{
+	struct sock *sk = sock->sk;
+
+	switch (sk->sk_family) {
+	case AF_INET:
+		break;
+#if IS_ENABLED(CONFIG_IPV6)
+	case AF_INET6:
+		break;
+#endif
+	default:
+		return -EOPNOTSUPP;
+	}
+
+	if (likely(!(flags & MSG_ERRQUEUE)))
+		sock_rps_record_flow(sk);
+	return sock_common_recvmsg(sock, msg, size, flags);
+}
+
+static const struct proto_ops tlsh_proto_ops = {
+	.family		= PF_TLSH,
+	.owner		= THIS_MODULE,
+
+	.release	= tlsh_release,
+	.bind		= tlsh_bind,
+	.connect	= sock_no_connect,
+	.socketpair	= sock_no_socketpair,
+	.accept		= tlsh_accept,
+	.getname	= tlsh_getname,
+	.poll		= tlsh_poll,
+	.ioctl		= sock_no_ioctl,
+	.gettstamp	= sock_gettstamp,
+	.listen		= tlsh_listen,
+	.shutdown	= tlsh_shutdown,
+	.setsockopt	= tlsh_setsockopt,
+	.getsockopt	= tlsh_getsockopt,
+	.sendmsg	= tlsh_sendmsg,
+	.recvmsg	= tlsh_recvmsg,
+	.mmap		= sock_no_mmap,
+	.sendpage	= sock_no_sendpage,
+};
+
+static struct proto tlsh_prot = {
+	.name			= "TLSH",
+	.owner			= THIS_MODULE,
+	.obj_size		= sizeof(struct tlsh_sock),
+};
+
+/**
+ * tlsh_pf_create - create an AF_TLSH socket
+ * @net: network namespace to own the new socket
+ * @sock: socket to initialize
+ * @protocol: IP protocol number (ignored)
+ * @kern: "boolean": 1 for kernel-internal sockets
+ *
+ * Return values:
+ *   %0: @sock was initialized, and module ref count incremented.
+ *   Negative errno values indicate initialization failed.
+ */
+int tlsh_pf_create(struct net *net, struct socket *sock, int protocol, int kern)
+{
+	struct sock *sk;
+	int rc;
+
+	if (protocol != IPPROTO_TCP)
+		return -EPROTONOSUPPORT;
+
+	/* only stream sockets are supported */
+	if (sock->type != SOCK_STREAM)
+		return -ESOCKTNOSUPPORT;
+
+	sock->state = SS_UNCONNECTED;
+	sock->ops = &tlsh_proto_ops;
+
+	/* Ref: A */
+	sk = sk_alloc(net, PF_TLSH, GFP_KERNEL, &tlsh_prot, kern);
+	if (!sk)
+		return -ENOMEM;
+
+	sock_init_data(sock, sk);
+	if (sk->sk_prot->init) {
+		rc = sk->sk_prot->init(sk);
+		if (rc)
+			goto err_sk_put;
+	}
+
+	tlsh_sk(sk)->th_bind_family = AF_UNSPEC;
+	return 0;
+
+err_sk_put:
+	sock_orphan(sk);
+	sk_free(sk);	/* Ref: A (err) */
+	return rc;
+}
+
+/**
+ * tls_client_hello_x509 - request an x.509-based TLS handshake on a socket
+ * @sock: connected socket on which to perform the handshake
+ * @done: function to call when the handshake has completed
+ * @data: token to pass back to @done
+ * @priorities: GnuTLS TLS priorities string
+ * @peerid: serial number of key containing private key
+ * @cert: serial number of key containing client's x.509 certificate
+ *
+ * Return values:
+ *   %0: Handshake request enqueue; ->done will be called when complete
+ *   %-ENOENT: No user agent is available
+ *   %-ENOMEM: Memory allocation failed
+ */
+int tls_client_hello_x509(struct socket *sock, void (*done)(void *data, int status),
+			  void *data, const char *priorities, key_serial_t peerid,
+			  key_serial_t cert)
+{
+	struct sock *listener, *sk = sock->sk;
+	struct tlsh_sock_info *info;
+	int rc;
+
+	listener = tlsh_find_listener(sock_net(sk), sk->sk_family);
+	if (!listener)
+		return -ENOENT;
+
+	info = kzalloc(sizeof(*info), GFP_KERNEL);
+	if (!info) {
+		sock_put(listener);	/* Ref: C (err) */
+		return -ENOMEM;
+	}
+
+	info->tsi_handshake_done = done;
+	info->tsi_handshake_data = data;
+	if (priorities && strlen(priorities)) {
+		info->tsi_tls_priorities = kstrdup(priorities, GFP_KERNEL);
+		if (!info->tsi_tls_priorities) {
+			tlsh_sock_info_destroy(info);
+			sock_put(listener);	/* Ref: C (err) */
+			return -ENOMEM;
+		}
+	}
+	info->tsi_peerid = peerid;
+	info->tsi_certificate = cert;
+	info->tsi_handshake_type = TLSH_TYPE_CLIENTHELLO_X509;
+	tlsh_sock_save(sk, info);
+
+	rc = tlsh_accept_enqueue(listener, sk);
+	if (rc) {
+		tlsh_sock_clear(sk);
+		sock_put(listener);	/* Ref: C (err) */
+	}
+
+	return rc;
+}
+EXPORT_SYMBOL_GPL(tls_client_hello_x509);
+
+/**
+ * tls_client_hello_psk - request a PSK-based TLS handshake on a socket
+ * @sock: connected socket on which to perform the handshake
+ * @done: function to call when the handshake has completed
+ * @data: token to pass back to @done
+ * @priorities: GnuTLS TLS priorities string
+ * @peerid: serial number of key containing TLS identity
+ *
+ * Return values:
+ *   %0: Handshake request enqueue; ->done will be called when complete
+ *   %-ENOENT: No user agent is available
+ *   %-ENOMEM: Memory allocation failed
+ */
+int tls_client_hello_psk(struct socket *sock, void (*done)(void *data, int status),
+			 void *data, const char *priorities, key_serial_t peerid)
+{
+	struct sock *listener, *sk = sock->sk;
+	struct tlsh_sock_info *info;
+	int rc;
+
+	listener = tlsh_find_listener(sock_net(sk), sk->sk_family);
+	if (!listener)
+		return -ENOENT;
+
+	info = kzalloc(sizeof(*info), GFP_KERNEL);
+	if (!info) {
+		sock_put(listener);	/* Ref: C (err) */
+		return -ENOMEM;
+	}
+
+	info->tsi_handshake_done = done;
+	info->tsi_handshake_data = data;
+	if (priorities && strlen(priorities)) {
+		info->tsi_tls_priorities = kstrdup(priorities, GFP_KERNEL);
+		if (!info->tsi_tls_priorities) {
+			tlsh_sock_info_destroy(info);
+			sock_put(listener);	/* Ref: C (err) */
+			return -ENOMEM;
+		}
+	}
+	info->tsi_peerid = peerid;
+	info->tsi_handshake_type = TLSH_TYPE_CLIENTHELLO_PSK;
+	tlsh_sock_save(sk, info);
+
+	rc = tlsh_accept_enqueue(listener, sk);
+	if (rc) {
+		tlsh_sock_clear(sk);
+		sock_put(listener);	/* Ref: C (err) */
+	}
+
+	return rc;
+}
+EXPORT_SYMBOL_GPL(tls_client_hello_psk);
diff --git a/net/tls/tls_main.c b/net/tls/tls_main.c
index 7eca4d9a83c4..c5e0a7b3aa2e 100644
--- a/net/tls/tls_main.c
+++ b/net/tls/tls_main.c
@@ -49,6 +49,7 @@  MODULE_AUTHOR("Mellanox Technologies");
 MODULE_DESCRIPTION("Transport Layer Security Support");
 MODULE_LICENSE("Dual BSD/GPL");
 MODULE_ALIAS_TCP_ULP("tls");
+MODULE_ALIAS_NETPROTO(PF_TLSH);
 
 enum {
 	TLSV4,
@@ -982,6 +983,12 @@  static struct tcp_ulp_ops tcp_tls_ulp_ops __read_mostly = {
 	.get_info_size		= tls_get_info_size,
 };
 
+static const struct net_proto_family tlsh_pf_ops = {
+	.family = PF_TLSH,
+	.create = tlsh_pf_create,
+	.owner	= THIS_MODULE,
+};
+
 static int __init tls_register(void)
 {
 	int err;
@@ -993,11 +1000,14 @@  static int __init tls_register(void)
 	tls_device_init();
 	tcp_register_ulp(&tcp_tls_ulp_ops);
 
+	sock_register(&tlsh_pf_ops);
+
 	return 0;
 }
 
 static void __exit tls_unregister(void)
 {
+	sock_unregister(PF_TLSH);
 	tcp_unregister_ulp(&tcp_tls_ulp_ops);
 	tls_device_cleanup();
 	unregister_pernet_subsys(&tls_proc_ops);