diff mbox series

[RFC,bpf-next] xsk: honor SO_BINDTODEVICE on bind

Message ID 20230630145831.2988845-1-i.maximets@ovn.org (mailing list archive)
State Superseded
Delegated to: BPF
Headers show
Series [RFC,bpf-next] xsk: honor SO_BINDTODEVICE on bind | expand

Checks

Context Check Description
netdev/series_format success Single patches do not need cover letters
netdev/tree_selection success Clearly marked for bpf-next
netdev/fixes_present success Fixes tag not required for -next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 10 this patch: 10
netdev/cc_maintainers warning 6 maintainers not CCed: hawk@kernel.org daniel@iogearbox.net john.fastabend@gmail.com corbet@lwn.net jonathan.lemon@gmail.com ast@kernel.org
netdev/build_clang fail Errors and warnings before: 31 this patch: 18
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 10 this patch: 10
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 33 lines checked
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0
bpf/vmtest-bpf-next-PR fail PR summary
bpf/vmtest-bpf-next-VM_Test-1 success Logs for ${{ matrix.test }} on ${{ matrix.arch }} with ${{ matrix.toolchain_full }}
bpf/vmtest-bpf-next-VM_Test-2 success Logs for ShellCheck
bpf/vmtest-bpf-next-VM_Test-3 success Logs for build for aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-4 success Logs for build for s390x with gcc
bpf/vmtest-bpf-next-VM_Test-5 fail Logs for build for x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-6 fail Logs for build for x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-7 success Logs for set-matrix
bpf/vmtest-bpf-next-VM_Test-8 success Logs for veristat

Commit Message

Ilya Maximets June 30, 2023, 2:58 p.m. UTC
Initial creation of an AF_XDP socket requires CAP_NET_RAW capability.
A privileged process might create the socket and pass it to a
non-privileged process for later use.  However, that process will be
able to bind the socket to any network interface.  Even though it will
not be able to receive any traffic without modification of the BPF map,
the situation is not ideal.

Sockets already have a mechanism that can be used to restrict what
interface they can be attached to.  That is SO_BINDTODEVICE.

To change the binding the process will need CAP_NET_RAW.

Make xsk_bind() honor the SO_BINDTODEVICE in order to allow safer
workflow when non-privileged process is using AF_XDP.

Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
---

Posting as an RFC for now to probably get some feedback.
Will re-post once the tree is open.

 Documentation/networking/af_xdp.rst | 9 +++++++++
 net/xdp/xsk.c                       | 6 ++++++
 2 files changed, 15 insertions(+)

Comments

Magnus Karlsson July 3, 2023, 9:48 a.m. UTC | #1
On Fri, 30 Jun 2023 at 16:58, Ilya Maximets <i.maximets@ovn.org> wrote:
>
> Initial creation of an AF_XDP socket requires CAP_NET_RAW capability.
> A privileged process might create the socket and pass it to a
> non-privileged process for later use.  However, that process will be
> able to bind the socket to any network interface.  Even though it will
> not be able to receive any traffic without modification of the BPF map,
> the situation is not ideal.
>
> Sockets already have a mechanism that can be used to restrict what
> interface they can be attached to.  That is SO_BINDTODEVICE.
>
> To change the binding the process will need CAP_NET_RAW.
>
> Make xsk_bind() honor the SO_BINDTODEVICE in order to allow safer
> workflow when non-privileged process is using AF_XDP.

Rebinding an AF_XDP socket is not allowed today. Any such attempt will
return an error from bind. So if I understand the purpose of
SO_BINDTODEVICE correctly, you could say that this option is always
set for an AF_XDP socket and it is not possible to toggle it. The only
way to "rebind" an AF_XDP socket is to close it and open a new one.
This was a conscious design decision from day one as it would be very
hard to support this, especially in zero-copy mode.

> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
> ---
>
> Posting as an RFC for now to probably get some feedback.
> Will re-post once the tree is open.
>
>  Documentation/networking/af_xdp.rst | 9 +++++++++
>  net/xdp/xsk.c                       | 6 ++++++
>  2 files changed, 15 insertions(+)
>
> diff --git a/Documentation/networking/af_xdp.rst b/Documentation/networking/af_xdp.rst
> index 247c6c4127e9..1cc35de336a4 100644
> --- a/Documentation/networking/af_xdp.rst
> +++ b/Documentation/networking/af_xdp.rst
> @@ -433,6 +433,15 @@ start N bytes into the buffer leaving the first N bytes for the
>  application to use. The final option is the flags field, but it will
>  be dealt with in separate sections for each UMEM flag.
>
> +SO_BINDTODEVICE setsockopt
> +--------------------------
> +
> +This is a generic SOL_SOCKET option that can be used to tie AF_XDP
> +socket to a particular network interface.  It is useful when a socket
> +is created by a privileged process and passed to a non-privileged one.
> +Once the option is set, kernel will refuse attempts to bind that socket
> +to a different interface.  Updating the value requires CAP_NET_RAW.
> +
>  XDP_STATISTICS getsockopt
>  -------------------------
>
> diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> index 5a8c0dd250af..386ff641db0f 100644
> --- a/net/xdp/xsk.c
> +++ b/net/xdp/xsk.c
> @@ -886,6 +886,7 @@ static int xsk_bind(struct socket *sock, struct sockaddr *addr, int addr_len)
>         struct sock *sk = sock->sk;
>         struct xdp_sock *xs = xdp_sk(sk);
>         struct net_device *dev;
> +       int bound_dev_if;
>         u32 flags, qid;
>         int err = 0;
>
> @@ -899,6 +900,11 @@ static int xsk_bind(struct socket *sock, struct sockaddr *addr, int addr_len)
>                       XDP_USE_NEED_WAKEUP))
>                 return -EINVAL;
>
> +       bound_dev_if = READ_ONCE(sk->sk_bound_dev_if);
> +
> +       if (bound_dev_if && bound_dev_if != sxdp->sxdp_ifindex)
> +               return -EINVAL;
> +
>         rtnl_lock();
>         mutex_lock(&xs->mutex);
>         if (xs->state != XSK_READY) {
> --
> 2.40.1
>
>
Ilya Maximets July 3, 2023, 10:06 a.m. UTC | #2
On 7/3/23 11:48, Magnus Karlsson wrote:
> On Fri, 30 Jun 2023 at 16:58, Ilya Maximets <i.maximets@ovn.org> wrote:
>>
>> Initial creation of an AF_XDP socket requires CAP_NET_RAW capability.
>> A privileged process might create the socket and pass it to a
>> non-privileged process for later use.  However, that process will be
>> able to bind the socket to any network interface.  Even though it will
>> not be able to receive any traffic without modification of the BPF map,
>> the situation is not ideal.
>>
>> Sockets already have a mechanism that can be used to restrict what
>> interface they can be attached to.  That is SO_BINDTODEVICE.
>>
>> To change the binding the process will need CAP_NET_RAW.
>>
>> Make xsk_bind() honor the SO_BINDTODEVICE in order to allow safer
>> workflow when non-privileged process is using AF_XDP.
> 
> Rebinding an AF_XDP socket is not allowed today. Any such attempt will
> return an error from bind. So if I understand the purpose of
> SO_BINDTODEVICE correctly, you could say that this option is always
> set for an AF_XDP socket and it is not possible to toggle it. The only
> way to "rebind" an AF_XDP socket is to close it and open a new one.
> This was a conscious design decision from day one as it would be very
> hard to support this, especially in zero-copy mode.

Hi, Magnus.

The purpose of this patch is not to allow re-binding.  The use case is
following:

1. First process creates a bare socket with socket(AF_XDP, ...).
2. First process loads the XSK program to the interface.
3. First process adds the socket fd to a BPF map.
4. First process sends socket fd to a second process.
5. Second process allocates UMEM.
6. Second process binds socket to the interface.

The idea is that the first process will call SO_BINDTODEVICE before
sending socket fd to a second process, so the second process is limited
in to which interface it can bind the socket.

Does that make sense?

This workflow allows the second process to have no capabilities
as long as it has sufficient RLIMIT_MEMLOCK.

Best regards, Ilya Maximets.

> 
>> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
>> ---
>>
>> Posting as an RFC for now to probably get some feedback.
>> Will re-post once the tree is open.
>>
>>  Documentation/networking/af_xdp.rst | 9 +++++++++
>>  net/xdp/xsk.c                       | 6 ++++++
>>  2 files changed, 15 insertions(+)
>>
>> diff --git a/Documentation/networking/af_xdp.rst b/Documentation/networking/af_xdp.rst
>> index 247c6c4127e9..1cc35de336a4 100644
>> --- a/Documentation/networking/af_xdp.rst
>> +++ b/Documentation/networking/af_xdp.rst
>> @@ -433,6 +433,15 @@ start N bytes into the buffer leaving the first N bytes for the
>>  application to use. The final option is the flags field, but it will
>>  be dealt with in separate sections for each UMEM flag.
>>
>> +SO_BINDTODEVICE setsockopt
>> +--------------------------
>> +
>> +This is a generic SOL_SOCKET option that can be used to tie AF_XDP
>> +socket to a particular network interface.  It is useful when a socket
>> +is created by a privileged process and passed to a non-privileged one.
>> +Once the option is set, kernel will refuse attempts to bind that socket
>> +to a different interface.  Updating the value requires CAP_NET_RAW.
>> +
>>  XDP_STATISTICS getsockopt
>>  -------------------------
>>
>> diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
>> index 5a8c0dd250af..386ff641db0f 100644
>> --- a/net/xdp/xsk.c
>> +++ b/net/xdp/xsk.c
>> @@ -886,6 +886,7 @@ static int xsk_bind(struct socket *sock, struct sockaddr *addr, int addr_len)
>>         struct sock *sk = sock->sk;
>>         struct xdp_sock *xs = xdp_sk(sk);
>>         struct net_device *dev;
>> +       int bound_dev_if;
>>         u32 flags, qid;
>>         int err = 0;
>>
>> @@ -899,6 +900,11 @@ static int xsk_bind(struct socket *sock, struct sockaddr *addr, int addr_len)
>>                       XDP_USE_NEED_WAKEUP))
>>                 return -EINVAL;
>>
>> +       bound_dev_if = READ_ONCE(sk->sk_bound_dev_if);
>> +
>> +       if (bound_dev_if && bound_dev_if != sxdp->sxdp_ifindex)
>> +               return -EINVAL;
>> +
>>         rtnl_lock();
>>         mutex_lock(&xs->mutex);
>>         if (xs->state != XSK_READY) {
>> --
>> 2.40.1
>>
>>
Ilya Maximets July 3, 2023, 10:14 a.m. UTC | #3
On 7/3/23 12:06, Ilya Maximets wrote:
> On 7/3/23 11:48, Magnus Karlsson wrote:
>> On Fri, 30 Jun 2023 at 16:58, Ilya Maximets <i.maximets@ovn.org> wrote:
>>>
>>> Initial creation of an AF_XDP socket requires CAP_NET_RAW capability.
>>> A privileged process might create the socket and pass it to a
>>> non-privileged process for later use.  However, that process will be
>>> able to bind the socket to any network interface.  Even though it will
>>> not be able to receive any traffic without modification of the BPF map,
>>> the situation is not ideal.
>>>
>>> Sockets already have a mechanism that can be used to restrict what
>>> interface they can be attached to.  That is SO_BINDTODEVICE.
>>>
>>> To change the binding the process will need CAP_NET_RAW.
>>>
>>> Make xsk_bind() honor the SO_BINDTODEVICE in order to allow safer
>>> workflow when non-privileged process is using AF_XDP.
>>
>> Rebinding an AF_XDP socket is not allowed today. Any such attempt will
>> return an error from bind. So if I understand the purpose of
>> SO_BINDTODEVICE correctly, you could say that this option is always
>> set for an AF_XDP socket and it is not possible to toggle it. The only
>> way to "rebind" an AF_XDP socket is to close it and open a new one.
>> This was a conscious design decision from day one as it would be very
>> hard to support this, especially in zero-copy mode.
> 
> Hi, Magnus.
> 
> The purpose of this patch is not to allow re-binding.  The use case is
> following:
> 
> 1. First process creates a bare socket with socket(AF_XDP, ...).
> 2. First process loads the XSK program to the interface.
> 3. First process adds the socket fd to a BPF map.
> 4. First process sends socket fd to a second process.
> 5. Second process allocates UMEM.
> 6. Second process binds socket to the interface.

7. Second process sends/receives the traffic. :)

> 
> The idea is that the first process will call SO_BINDTODEVICE before
> sending socket fd to a second process, so the second process is limited
> in to which interface it can bind the socket.
> 
> Does that make sense?
> 
> This workflow allows the second process to have no capabilities
> as long as it has sufficient RLIMIT_MEMLOCK.

Note that steps 1-7 are working just fine today.  i.e. the umem
registration, bind, ring mapping and traffic send/receive do not
require any extra capabilities.

We may restrict the bind() call to require CAP_NET_RAW and then
allow it for sockets that had SO_BINDTODEVICE as an alternative.
But restriction will break the current uAPI.

> 
> Best regards, Ilya Maximets.
> 
>>
>>> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
>>> ---
>>>
>>> Posting as an RFC for now to probably get some feedback.
>>> Will re-post once the tree is open.
>>>
>>>  Documentation/networking/af_xdp.rst | 9 +++++++++
>>>  net/xdp/xsk.c                       | 6 ++++++
>>>  2 files changed, 15 insertions(+)
>>>
>>> diff --git a/Documentation/networking/af_xdp.rst b/Documentation/networking/af_xdp.rst
>>> index 247c6c4127e9..1cc35de336a4 100644
>>> --- a/Documentation/networking/af_xdp.rst
>>> +++ b/Documentation/networking/af_xdp.rst
>>> @@ -433,6 +433,15 @@ start N bytes into the buffer leaving the first N bytes for the
>>>  application to use. The final option is the flags field, but it will
>>>  be dealt with in separate sections for each UMEM flag.
>>>
>>> +SO_BINDTODEVICE setsockopt
>>> +--------------------------
>>> +
>>> +This is a generic SOL_SOCKET option that can be used to tie AF_XDP
>>> +socket to a particular network interface.  It is useful when a socket
>>> +is created by a privileged process and passed to a non-privileged one.
>>> +Once the option is set, kernel will refuse attempts to bind that socket
>>> +to a different interface.  Updating the value requires CAP_NET_RAW.
>>> +
>>>  XDP_STATISTICS getsockopt
>>>  -------------------------
>>>
>>> diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
>>> index 5a8c0dd250af..386ff641db0f 100644
>>> --- a/net/xdp/xsk.c
>>> +++ b/net/xdp/xsk.c
>>> @@ -886,6 +886,7 @@ static int xsk_bind(struct socket *sock, struct sockaddr *addr, int addr_len)
>>>         struct sock *sk = sock->sk;
>>>         struct xdp_sock *xs = xdp_sk(sk);
>>>         struct net_device *dev;
>>> +       int bound_dev_if;
>>>         u32 flags, qid;
>>>         int err = 0;
>>>
>>> @@ -899,6 +900,11 @@ static int xsk_bind(struct socket *sock, struct sockaddr *addr, int addr_len)
>>>                       XDP_USE_NEED_WAKEUP))
>>>                 return -EINVAL;
>>>
>>> +       bound_dev_if = READ_ONCE(sk->sk_bound_dev_if);
>>> +
>>> +       if (bound_dev_if && bound_dev_if != sxdp->sxdp_ifindex)
>>> +               return -EINVAL;
>>> +
>>>         rtnl_lock();
>>>         mutex_lock(&xs->mutex);
>>>         if (xs->state != XSK_READY) {
>>> --
>>> 2.40.1
>>>
>>>
>
Magnus Karlsson July 3, 2023, 10:24 a.m. UTC | #4
On Mon, 3 Jul 2023 at 12:13, Ilya Maximets <i.maximets@ovn.org> wrote:
>
> On 7/3/23 12:06, Ilya Maximets wrote:
> > On 7/3/23 11:48, Magnus Karlsson wrote:
> >> On Fri, 30 Jun 2023 at 16:58, Ilya Maximets <i.maximets@ovn.org> wrote:
> >>>
> >>> Initial creation of an AF_XDP socket requires CAP_NET_RAW capability.
> >>> A privileged process might create the socket and pass it to a
> >>> non-privileged process for later use.  However, that process will be
> >>> able to bind the socket to any network interface.  Even though it will
> >>> not be able to receive any traffic without modification of the BPF map,
> >>> the situation is not ideal.
> >>>
> >>> Sockets already have a mechanism that can be used to restrict what
> >>> interface they can be attached to.  That is SO_BINDTODEVICE.
> >>>
> >>> To change the binding the process will need CAP_NET_RAW.
> >>>
> >>> Make xsk_bind() honor the SO_BINDTODEVICE in order to allow safer
> >>> workflow when non-privileged process is using AF_XDP.
> >>
> >> Rebinding an AF_XDP socket is not allowed today. Any such attempt will
> >> return an error from bind. So if I understand the purpose of
> >> SO_BINDTODEVICE correctly, you could say that this option is always
> >> set for an AF_XDP socket and it is not possible to toggle it. The only
> >> way to "rebind" an AF_XDP socket is to close it and open a new one.
> >> This was a conscious design decision from day one as it would be very
> >> hard to support this, especially in zero-copy mode.
> >
> > Hi, Magnus.
> >
> > The purpose of this patch is not to allow re-binding.  The use case is
> > following:
> >
> > 1. First process creates a bare socket with socket(AF_XDP, ...).
> > 2. First process loads the XSK program to the interface.
> > 3. First process adds the socket fd to a BPF map.
> > 4. First process sends socket fd to a second process.
> > 5. Second process allocates UMEM.
> > 6. Second process binds socket to the interface.
>
> 7. Second process sends/receives the traffic. :)
>
> >
> > The idea is that the first process will call SO_BINDTODEVICE before
> > sending socket fd to a second process, so the second process is limited
> > in to which interface it can bind the socket.
> >
> > Does that make sense?

Thanks for explaining this to me. Yes, that makes sense and seems
useful. Could you please send a v2 and include the flow (1-7) above in
your commit message? Would be good to add one step with the setsockopt
SO_BINDTODEVICE before step #4 just to be clear. With those changes
please feel free to include my ack:

 Acked-by: Magnus Karlsson <magnus.karlsson@intel.com>

Thank you!

> > This workflow allows the second process to have no capabilities
> > as long as it has sufficient RLIMIT_MEMLOCK.
>
> Note that steps 1-7 are working just fine today.  i.e. the umem
> registration, bind, ring mapping and traffic send/receive do not
> require any extra capabilities.
>
> We may restrict the bind() call to require CAP_NET_RAW and then
> allow it for sockets that had SO_BINDTODEVICE as an alternative.
> But restriction will break the current uAPI.
>
> >
> > Best regards, Ilya Maximets.
> >
> >>
> >>> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
> >>> ---
> >>>
> >>> Posting as an RFC for now to probably get some feedback.
> >>> Will re-post once the tree is open.
> >>>
> >>>  Documentation/networking/af_xdp.rst | 9 +++++++++
> >>>  net/xdp/xsk.c                       | 6 ++++++
> >>>  2 files changed, 15 insertions(+)
> >>>
> >>> diff --git a/Documentation/networking/af_xdp.rst b/Documentation/networking/af_xdp.rst
> >>> index 247c6c4127e9..1cc35de336a4 100644
> >>> --- a/Documentation/networking/af_xdp.rst
> >>> +++ b/Documentation/networking/af_xdp.rst
> >>> @@ -433,6 +433,15 @@ start N bytes into the buffer leaving the first N bytes for the
> >>>  application to use. The final option is the flags field, but it will
> >>>  be dealt with in separate sections for each UMEM flag.
> >>>
> >>> +SO_BINDTODEVICE setsockopt
> >>> +--------------------------
> >>> +
> >>> +This is a generic SOL_SOCKET option that can be used to tie AF_XDP
> >>> +socket to a particular network interface.  It is useful when a socket
> >>> +is created by a privileged process and passed to a non-privileged one.
> >>> +Once the option is set, kernel will refuse attempts to bind that socket
> >>> +to a different interface.  Updating the value requires CAP_NET_RAW.
> >>> +
> >>>  XDP_STATISTICS getsockopt
> >>>  -------------------------
> >>>
> >>> diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> >>> index 5a8c0dd250af..386ff641db0f 100644
> >>> --- a/net/xdp/xsk.c
> >>> +++ b/net/xdp/xsk.c
> >>> @@ -886,6 +886,7 @@ static int xsk_bind(struct socket *sock, struct sockaddr *addr, int addr_len)
> >>>         struct sock *sk = sock->sk;
> >>>         struct xdp_sock *xs = xdp_sk(sk);
> >>>         struct net_device *dev;
> >>> +       int bound_dev_if;
> >>>         u32 flags, qid;
> >>>         int err = 0;
> >>>
> >>> @@ -899,6 +900,11 @@ static int xsk_bind(struct socket *sock, struct sockaddr *addr, int addr_len)
> >>>                       XDP_USE_NEED_WAKEUP))
> >>>                 return -EINVAL;
> >>>
> >>> +       bound_dev_if = READ_ONCE(sk->sk_bound_dev_if);
> >>> +
> >>> +       if (bound_dev_if && bound_dev_if != sxdp->sxdp_ifindex)
> >>> +               return -EINVAL;
> >>> +
> >>>         rtnl_lock();
> >>>         mutex_lock(&xs->mutex);
> >>>         if (xs->state != XSK_READY) {
> >>> --
> >>> 2.40.1
> >>>
> >>>
> >
>
Ilya Maximets July 3, 2023, 11:17 a.m. UTC | #5
On 7/3/23 12:24, Magnus Karlsson wrote:
> On Mon, 3 Jul 2023 at 12:13, Ilya Maximets <i.maximets@ovn.org> wrote:
>>
>> On 7/3/23 12:06, Ilya Maximets wrote:
>>> On 7/3/23 11:48, Magnus Karlsson wrote:
>>>> On Fri, 30 Jun 2023 at 16:58, Ilya Maximets <i.maximets@ovn.org> wrote:
>>>>>
>>>>> Initial creation of an AF_XDP socket requires CAP_NET_RAW capability.
>>>>> A privileged process might create the socket and pass it to a
>>>>> non-privileged process for later use.  However, that process will be
>>>>> able to bind the socket to any network interface.  Even though it will
>>>>> not be able to receive any traffic without modification of the BPF map,
>>>>> the situation is not ideal.
>>>>>
>>>>> Sockets already have a mechanism that can be used to restrict what
>>>>> interface they can be attached to.  That is SO_BINDTODEVICE.
>>>>>
>>>>> To change the binding the process will need CAP_NET_RAW.
>>>>>
>>>>> Make xsk_bind() honor the SO_BINDTODEVICE in order to allow safer
>>>>> workflow when non-privileged process is using AF_XDP.
>>>>
>>>> Rebinding an AF_XDP socket is not allowed today. Any such attempt will
>>>> return an error from bind. So if I understand the purpose of
>>>> SO_BINDTODEVICE correctly, you could say that this option is always
>>>> set for an AF_XDP socket and it is not possible to toggle it. The only
>>>> way to "rebind" an AF_XDP socket is to close it and open a new one.
>>>> This was a conscious design decision from day one as it would be very
>>>> hard to support this, especially in zero-copy mode.
>>>
>>> Hi, Magnus.
>>>
>>> The purpose of this patch is not to allow re-binding.  The use case is
>>> following:
>>>
>>> 1. First process creates a bare socket with socket(AF_XDP, ...).
>>> 2. First process loads the XSK program to the interface.
>>> 3. First process adds the socket fd to a BPF map.
>>> 4. First process sends socket fd to a second process.
>>> 5. Second process allocates UMEM.
>>> 6. Second process binds socket to the interface.
>>
>> 7. Second process sends/receives the traffic. :)
>>
>>>
>>> The idea is that the first process will call SO_BINDTODEVICE before
>>> sending socket fd to a second process, so the second process is limited
>>> in to which interface it can bind the socket.
>>>
>>> Does that make sense?
> 
> Thanks for explaining this to me. Yes, that makes sense and seems
> useful. Could you please send a v2 and include the flow (1-7) above in
> your commit message? Would be good to add one step with the setsockopt
> SO_BINDTODEVICE before step #4 just to be clear. With those changes
> please feel free to include my ack:
> 
>  Acked-by: Magnus Karlsson <magnus.karlsson@intel.com>

Thanks!  I'll update the commit message with the steps above to make it
more clear.

I was planning to send a non-RFC version of this patch once the tree is
open (in a week).  Or are the rules for bpf-next different?

> 
> Thank you!
> 
>>> This workflow allows the second process to have no capabilities
>>> as long as it has sufficient RLIMIT_MEMLOCK.
>>
>> Note that steps 1-7 are working just fine today.  i.e. the umem
>> registration, bind, ring mapping and traffic send/receive do not
>> require any extra capabilities.
>>
>> We may restrict the bind() call to require CAP_NET_RAW and then
>> allow it for sockets that had SO_BINDTODEVICE as an alternative.
>> But restriction will break the current uAPI.
>>
>>>
>>> Best regards, Ilya Maximets.
>>>
>>>>
>>>>> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
>>>>> ---
>>>>>
>>>>> Posting as an RFC for now to probably get some feedback.
>>>>> Will re-post once the tree is open.
>>>>>
>>>>>  Documentation/networking/af_xdp.rst | 9 +++++++++
>>>>>  net/xdp/xsk.c                       | 6 ++++++
>>>>>  2 files changed, 15 insertions(+)
>>>>>
>>>>> diff --git a/Documentation/networking/af_xdp.rst b/Documentation/networking/af_xdp.rst
>>>>> index 247c6c4127e9..1cc35de336a4 100644
>>>>> --- a/Documentation/networking/af_xdp.rst
>>>>> +++ b/Documentation/networking/af_xdp.rst
>>>>> @@ -433,6 +433,15 @@ start N bytes into the buffer leaving the first N bytes for the
>>>>>  application to use. The final option is the flags field, but it will
>>>>>  be dealt with in separate sections for each UMEM flag.
>>>>>
>>>>> +SO_BINDTODEVICE setsockopt
>>>>> +--------------------------
>>>>> +
>>>>> +This is a generic SOL_SOCKET option that can be used to tie AF_XDP
>>>>> +socket to a particular network interface.  It is useful when a socket
>>>>> +is created by a privileged process and passed to a non-privileged one.
>>>>> +Once the option is set, kernel will refuse attempts to bind that socket
>>>>> +to a different interface.  Updating the value requires CAP_NET_RAW.
>>>>> +
>>>>>  XDP_STATISTICS getsockopt
>>>>>  -------------------------
>>>>>
>>>>> diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
>>>>> index 5a8c0dd250af..386ff641db0f 100644
>>>>> --- a/net/xdp/xsk.c
>>>>> +++ b/net/xdp/xsk.c
>>>>> @@ -886,6 +886,7 @@ static int xsk_bind(struct socket *sock, struct sockaddr *addr, int addr_len)
>>>>>         struct sock *sk = sock->sk;
>>>>>         struct xdp_sock *xs = xdp_sk(sk);
>>>>>         struct net_device *dev;
>>>>> +       int bound_dev_if;
>>>>>         u32 flags, qid;
>>>>>         int err = 0;
>>>>>
>>>>> @@ -899,6 +900,11 @@ static int xsk_bind(struct socket *sock, struct sockaddr *addr, int addr_len)
>>>>>                       XDP_USE_NEED_WAKEUP))
>>>>>                 return -EINVAL;
>>>>>
>>>>> +       bound_dev_if = READ_ONCE(sk->sk_bound_dev_if);
>>>>> +
>>>>> +       if (bound_dev_if && bound_dev_if != sxdp->sxdp_ifindex)
>>>>> +               return -EINVAL;
>>>>> +
>>>>>         rtnl_lock();
>>>>>         mutex_lock(&xs->mutex);
>>>>>         if (xs->state != XSK_READY) {
>>>>> --
>>>>> 2.40.1
>>>>>
>>>>>
>>>
>>
Magnus Karlsson July 3, 2023, 11:19 a.m. UTC | #6
On Mon, 3 Jul 2023 at 13:16, Ilya Maximets <i.maximets@ovn.org> wrote:
>
> On 7/3/23 12:24, Magnus Karlsson wrote:
> > On Mon, 3 Jul 2023 at 12:13, Ilya Maximets <i.maximets@ovn.org> wrote:
> >>
> >> On 7/3/23 12:06, Ilya Maximets wrote:
> >>> On 7/3/23 11:48, Magnus Karlsson wrote:
> >>>> On Fri, 30 Jun 2023 at 16:58, Ilya Maximets <i.maximets@ovn.org> wrote:
> >>>>>
> >>>>> Initial creation of an AF_XDP socket requires CAP_NET_RAW capability.
> >>>>> A privileged process might create the socket and pass it to a
> >>>>> non-privileged process for later use.  However, that process will be
> >>>>> able to bind the socket to any network interface.  Even though it will
> >>>>> not be able to receive any traffic without modification of the BPF map,
> >>>>> the situation is not ideal.
> >>>>>
> >>>>> Sockets already have a mechanism that can be used to restrict what
> >>>>> interface they can be attached to.  That is SO_BINDTODEVICE.
> >>>>>
> >>>>> To change the binding the process will need CAP_NET_RAW.
> >>>>>
> >>>>> Make xsk_bind() honor the SO_BINDTODEVICE in order to allow safer
> >>>>> workflow when non-privileged process is using AF_XDP.
> >>>>
> >>>> Rebinding an AF_XDP socket is not allowed today. Any such attempt will
> >>>> return an error from bind. So if I understand the purpose of
> >>>> SO_BINDTODEVICE correctly, you could say that this option is always
> >>>> set for an AF_XDP socket and it is not possible to toggle it. The only
> >>>> way to "rebind" an AF_XDP socket is to close it and open a new one.
> >>>> This was a conscious design decision from day one as it would be very
> >>>> hard to support this, especially in zero-copy mode.
> >>>
> >>> Hi, Magnus.
> >>>
> >>> The purpose of this patch is not to allow re-binding.  The use case is
> >>> following:
> >>>
> >>> 1. First process creates a bare socket with socket(AF_XDP, ...).
> >>> 2. First process loads the XSK program to the interface.
> >>> 3. First process adds the socket fd to a BPF map.
> >>> 4. First process sends socket fd to a second process.
> >>> 5. Second process allocates UMEM.
> >>> 6. Second process binds socket to the interface.
> >>
> >> 7. Second process sends/receives the traffic. :)
> >>
> >>>
> >>> The idea is that the first process will call SO_BINDTODEVICE before
> >>> sending socket fd to a second process, so the second process is limited
> >>> in to which interface it can bind the socket.
> >>>
> >>> Does that make sense?
> >
> > Thanks for explaining this to me. Yes, that makes sense and seems
> > useful. Could you please send a v2 and include the flow (1-7) above in
> > your commit message? Would be good to add one step with the setsockopt
> > SO_BINDTODEVICE before step #4 just to be clear. With those changes
> > please feel free to include my ack:
> >
> >  Acked-by: Magnus Karlsson <magnus.karlsson@intel.com>
>
> Thanks!  I'll update the commit message with the steps above to make it
> more clear.
>
> I was planning to send a non-RFC version of this patch once the tree is
> open (in a week).  Or are the rules for bpf-next different?

Bpf-next is always open I believe.

> >
> > Thank you!
> >
> >>> This workflow allows the second process to have no capabilities
> >>> as long as it has sufficient RLIMIT_MEMLOCK.
> >>
> >> Note that steps 1-7 are working just fine today.  i.e. the umem
> >> registration, bind, ring mapping and traffic send/receive do not
> >> require any extra capabilities.
> >>
> >> We may restrict the bind() call to require CAP_NET_RAW and then
> >> allow it for sockets that had SO_BINDTODEVICE as an alternative.
> >> But restriction will break the current uAPI.
> >>
> >>>
> >>> Best regards, Ilya Maximets.
> >>>
> >>>>
> >>>>> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
> >>>>> ---
> >>>>>
> >>>>> Posting as an RFC for now to probably get some feedback.
> >>>>> Will re-post once the tree is open.
> >>>>>
> >>>>>  Documentation/networking/af_xdp.rst | 9 +++++++++
> >>>>>  net/xdp/xsk.c                       | 6 ++++++
> >>>>>  2 files changed, 15 insertions(+)
> >>>>>
> >>>>> diff --git a/Documentation/networking/af_xdp.rst b/Documentation/networking/af_xdp.rst
> >>>>> index 247c6c4127e9..1cc35de336a4 100644
> >>>>> --- a/Documentation/networking/af_xdp.rst
> >>>>> +++ b/Documentation/networking/af_xdp.rst
> >>>>> @@ -433,6 +433,15 @@ start N bytes into the buffer leaving the first N bytes for the
> >>>>>  application to use. The final option is the flags field, but it will
> >>>>>  be dealt with in separate sections for each UMEM flag.
> >>>>>
> >>>>> +SO_BINDTODEVICE setsockopt
> >>>>> +--------------------------
> >>>>> +
> >>>>> +This is a generic SOL_SOCKET option that can be used to tie AF_XDP
> >>>>> +socket to a particular network interface.  It is useful when a socket
> >>>>> +is created by a privileged process and passed to a non-privileged one.
> >>>>> +Once the option is set, kernel will refuse attempts to bind that socket
> >>>>> +to a different interface.  Updating the value requires CAP_NET_RAW.
> >>>>> +
> >>>>>  XDP_STATISTICS getsockopt
> >>>>>  -------------------------
> >>>>>
> >>>>> diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> >>>>> index 5a8c0dd250af..386ff641db0f 100644
> >>>>> --- a/net/xdp/xsk.c
> >>>>> +++ b/net/xdp/xsk.c
> >>>>> @@ -886,6 +886,7 @@ static int xsk_bind(struct socket *sock, struct sockaddr *addr, int addr_len)
> >>>>>         struct sock *sk = sock->sk;
> >>>>>         struct xdp_sock *xs = xdp_sk(sk);
> >>>>>         struct net_device *dev;
> >>>>> +       int bound_dev_if;
> >>>>>         u32 flags, qid;
> >>>>>         int err = 0;
> >>>>>
> >>>>> @@ -899,6 +900,11 @@ static int xsk_bind(struct socket *sock, struct sockaddr *addr, int addr_len)
> >>>>>                       XDP_USE_NEED_WAKEUP))
> >>>>>                 return -EINVAL;
> >>>>>
> >>>>> +       bound_dev_if = READ_ONCE(sk->sk_bound_dev_if);
> >>>>> +
> >>>>> +       if (bound_dev_if && bound_dev_if != sxdp->sxdp_ifindex)
> >>>>> +               return -EINVAL;
> >>>>> +
> >>>>>         rtnl_lock();
> >>>>>         mutex_lock(&xs->mutex);
> >>>>>         if (xs->state != XSK_READY) {
> >>>>> --
> >>>>> 2.40.1
> >>>>>
> >>>>>
> >>>
> >>
>
Ilya Maximets July 3, 2023, 5:30 p.m. UTC | #7
On 7/3/23 13:19, Magnus Karlsson wrote:
> On Mon, 3 Jul 2023 at 13:16, Ilya Maximets <i.maximets@ovn.org> wrote:
>>
>> On 7/3/23 12:24, Magnus Karlsson wrote:
>>> On Mon, 3 Jul 2023 at 12:13, Ilya Maximets <i.maximets@ovn.org> wrote:
>>>>
>>>> On 7/3/23 12:06, Ilya Maximets wrote:
>>>>> On 7/3/23 11:48, Magnus Karlsson wrote:
>>>>>> On Fri, 30 Jun 2023 at 16:58, Ilya Maximets <i.maximets@ovn.org> wrote:
>>>>>>>
>>>>>>> Initial creation of an AF_XDP socket requires CAP_NET_RAW capability.
>>>>>>> A privileged process might create the socket and pass it to a
>>>>>>> non-privileged process for later use.  However, that process will be
>>>>>>> able to bind the socket to any network interface.  Even though it will
>>>>>>> not be able to receive any traffic without modification of the BPF map,
>>>>>>> the situation is not ideal.
>>>>>>>
>>>>>>> Sockets already have a mechanism that can be used to restrict what
>>>>>>> interface they can be attached to.  That is SO_BINDTODEVICE.
>>>>>>>
>>>>>>> To change the binding the process will need CAP_NET_RAW.
>>>>>>>
>>>>>>> Make xsk_bind() honor the SO_BINDTODEVICE in order to allow safer
>>>>>>> workflow when non-privileged process is using AF_XDP.
>>>>>>
>>>>>> Rebinding an AF_XDP socket is not allowed today. Any such attempt will
>>>>>> return an error from bind. So if I understand the purpose of
>>>>>> SO_BINDTODEVICE correctly, you could say that this option is always
>>>>>> set for an AF_XDP socket and it is not possible to toggle it. The only
>>>>>> way to "rebind" an AF_XDP socket is to close it and open a new one.
>>>>>> This was a conscious design decision from day one as it would be very
>>>>>> hard to support this, especially in zero-copy mode.
>>>>>
>>>>> Hi, Magnus.
>>>>>
>>>>> The purpose of this patch is not to allow re-binding.  The use case is
>>>>> following:
>>>>>
>>>>> 1. First process creates a bare socket with socket(AF_XDP, ...).
>>>>> 2. First process loads the XSK program to the interface.
>>>>> 3. First process adds the socket fd to a BPF map.
>>>>> 4. First process sends socket fd to a second process.
>>>>> 5. Second process allocates UMEM.
>>>>> 6. Second process binds socket to the interface.
>>>>
>>>> 7. Second process sends/receives the traffic. :)
>>>>
>>>>>
>>>>> The idea is that the first process will call SO_BINDTODEVICE before
>>>>> sending socket fd to a second process, so the second process is limited
>>>>> in to which interface it can bind the socket.
>>>>>
>>>>> Does that make sense?
>>>
>>> Thanks for explaining this to me. Yes, that makes sense and seems
>>> useful. Could you please send a v2 and include the flow (1-7) above in
>>> your commit message? Would be good to add one step with the setsockopt
>>> SO_BINDTODEVICE before step #4 just to be clear. With those changes
>>> please feel free to include my ack:
>>>
>>>  Acked-by: Magnus Karlsson <magnus.karlsson@intel.com>
>>
>> Thanks!  I'll update the commit message with the steps above to make it
>> more clear.
>>
>> I was planning to send a non-RFC version of this patch once the tree is
>> open (in a week).  Or are the rules for bpf-next different?
> 
> Bpf-next is always open I believe.

OK.  I'll update the commit message and send a formal patch then.  Thanks!

> 
>>>
>>> Thank you!
>>>
>>>>> This workflow allows the second process to have no capabilities
>>>>> as long as it has sufficient RLIMIT_MEMLOCK.
>>>>
>>>> Note that steps 1-7 are working just fine today.  i.e. the umem
>>>> registration, bind, ring mapping and traffic send/receive do not
>>>> require any extra capabilities.
>>>>
>>>> We may restrict the bind() call to require CAP_NET_RAW and then
>>>> allow it for sockets that had SO_BINDTODEVICE as an alternative.
>>>> But restriction will break the current uAPI.
>>>>
>>>>>
>>>>> Best regards, Ilya Maximets.
>>>>>
>>>>>>
>>>>>>> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
>>>>>>> ---
>>>>>>>
>>>>>>> Posting as an RFC for now to probably get some feedback.
>>>>>>> Will re-post once the tree is open.
>>>>>>>
>>>>>>>  Documentation/networking/af_xdp.rst | 9 +++++++++
>>>>>>>  net/xdp/xsk.c                       | 6 ++++++
>>>>>>>  2 files changed, 15 insertions(+)
>>>>>>>
>>>>>>> diff --git a/Documentation/networking/af_xdp.rst b/Documentation/networking/af_xdp.rst
>>>>>>> index 247c6c4127e9..1cc35de336a4 100644
>>>>>>> --- a/Documentation/networking/af_xdp.rst
>>>>>>> +++ b/Documentation/networking/af_xdp.rst
>>>>>>> @@ -433,6 +433,15 @@ start N bytes into the buffer leaving the first N bytes for the
>>>>>>>  application to use. The final option is the flags field, but it will
>>>>>>>  be dealt with in separate sections for each UMEM flag.
>>>>>>>
>>>>>>> +SO_BINDTODEVICE setsockopt
>>>>>>> +--------------------------
>>>>>>> +
>>>>>>> +This is a generic SOL_SOCKET option that can be used to tie AF_XDP
>>>>>>> +socket to a particular network interface.  It is useful when a socket
>>>>>>> +is created by a privileged process and passed to a non-privileged one.
>>>>>>> +Once the option is set, kernel will refuse attempts to bind that socket
>>>>>>> +to a different interface.  Updating the value requires CAP_NET_RAW.
>>>>>>> +
>>>>>>>  XDP_STATISTICS getsockopt
>>>>>>>  -------------------------
>>>>>>>
>>>>>>> diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
>>>>>>> index 5a8c0dd250af..386ff641db0f 100644
>>>>>>> --- a/net/xdp/xsk.c
>>>>>>> +++ b/net/xdp/xsk.c
>>>>>>> @@ -886,6 +886,7 @@ static int xsk_bind(struct socket *sock, struct sockaddr *addr, int addr_len)
>>>>>>>         struct sock *sk = sock->sk;
>>>>>>>         struct xdp_sock *xs = xdp_sk(sk);
>>>>>>>         struct net_device *dev;
>>>>>>> +       int bound_dev_if;
>>>>>>>         u32 flags, qid;
>>>>>>>         int err = 0;
>>>>>>>
>>>>>>> @@ -899,6 +900,11 @@ static int xsk_bind(struct socket *sock, struct sockaddr *addr, int addr_len)
>>>>>>>                       XDP_USE_NEED_WAKEUP))
>>>>>>>                 return -EINVAL;
>>>>>>>
>>>>>>> +       bound_dev_if = READ_ONCE(sk->sk_bound_dev_if);
>>>>>>> +
>>>>>>> +       if (bound_dev_if && bound_dev_if != sxdp->sxdp_ifindex)
>>>>>>> +               return -EINVAL;
>>>>>>> +
>>>>>>>         rtnl_lock();
>>>>>>>         mutex_lock(&xs->mutex);
>>>>>>>         if (xs->state != XSK_READY) {
>>>>>>> --
>>>>>>> 2.40.1
>>>>>>>
>>>>>>>
>>>>>
>>>>
>>
diff mbox series

Patch

diff --git a/Documentation/networking/af_xdp.rst b/Documentation/networking/af_xdp.rst
index 247c6c4127e9..1cc35de336a4 100644
--- a/Documentation/networking/af_xdp.rst
+++ b/Documentation/networking/af_xdp.rst
@@ -433,6 +433,15 @@  start N bytes into the buffer leaving the first N bytes for the
 application to use. The final option is the flags field, but it will
 be dealt with in separate sections for each UMEM flag.
 
+SO_BINDTODEVICE setsockopt
+--------------------------
+
+This is a generic SOL_SOCKET option that can be used to tie AF_XDP
+socket to a particular network interface.  It is useful when a socket
+is created by a privileged process and passed to a non-privileged one.
+Once the option is set, kernel will refuse attempts to bind that socket
+to a different interface.  Updating the value requires CAP_NET_RAW.
+
 XDP_STATISTICS getsockopt
 -------------------------
 
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index 5a8c0dd250af..386ff641db0f 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -886,6 +886,7 @@  static int xsk_bind(struct socket *sock, struct sockaddr *addr, int addr_len)
 	struct sock *sk = sock->sk;
 	struct xdp_sock *xs = xdp_sk(sk);
 	struct net_device *dev;
+	int bound_dev_if;
 	u32 flags, qid;
 	int err = 0;
 
@@ -899,6 +900,11 @@  static int xsk_bind(struct socket *sock, struct sockaddr *addr, int addr_len)
 		      XDP_USE_NEED_WAKEUP))
 		return -EINVAL;
 
+	bound_dev_if = READ_ONCE(sk->sk_bound_dev_if);
+
+	if (bound_dev_if && bound_dev_if != sxdp->sxdp_ifindex)
+		return -EINVAL;
+
 	rtnl_lock();
 	mutex_lock(&xs->mutex);
 	if (xs->state != XSK_READY) {