diff mbox series

[bpf-next,02/14] bpf: net: Avoid sock_setsockopt() taking sk lock when called from bpf

Message ID 20220727060909.2371812-1-kafai@fb.com (mailing list archive)
State Changes Requested
Delegated to: BPF
Headers show
Series bpf: net: Remove duplicated codes from bpf_setsockopt() | expand

Checks

Context Check Description
bpf/vmtest-bpf-next-PR fail PR summary
netdev/tree_selection success Clearly marked for bpf-next, async
netdev/fixes_present success Fixes tag not required for -next series
netdev/subject_prefix success Link
netdev/cover_letter success Series has a cover letter
netdev/patch_count success Link
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 5557 this patch: 5557
netdev/cc_maintainers success CCed 7 of 7 maintainers
netdev/build_clang success Errors and warnings before: 1179 this patch: 1179
netdev/module_param success Was 0 now: 0
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 5695 this patch: 5695
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 64 lines checked
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0
bpf/vmtest-bpf-next-VM_Test-1 success Logs for Kernel LATEST on ubuntu-latest with gcc
bpf/vmtest-bpf-next-VM_Test-2 success Logs for Kernel LATEST on ubuntu-latest with llvm-15
bpf/vmtest-bpf-next-VM_Test-3 fail Logs for Kernel LATEST on z15 with gcc

Commit Message

Martin KaFai Lau July 27, 2022, 6:09 a.m. UTC
Most of the codes in bpf_setsockopt(SOL_SOCKET) are duplicated from
the sock_setsockopt().  The number of supported options are
increasing ever and so as the duplicated codes.

One issue in reusing sock_setsockopt() is that the bpf prog
has already acquired the sk lock.  sockptr_t is useful to handle this.
sockptr_t already has a bit 'is_kernel' to handle the kernel-or-user
memory copy.  This patch adds a 'is_bpf' bit to tell if sk locking
has already been ensured by the bpf prog.

{lock,release}_sock_sockopt() helpers are added for the
sock_setsockopt() to use.  These helpers will handle the new
'is_bpf' bit.

Note on the change in sock_setbindtodevice().  lock_sock_sockopt()
is done in sock_setbindtodevice() instead of doing the lock_sock
in sock_bindtoindex(..., lock_sk = true).

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 include/linux/sockptr.h |  8 +++++++-
 include/net/sock.h      | 12 ++++++++++++
 net/core/sock.c         |  8 +++++---
 3 files changed, 24 insertions(+), 4 deletions(-)

Comments

David Laight July 27, 2022, 8:36 a.m. UTC | #1
From: Martin KaFai Lau
> Sent: 27 July 2022 07:09
> 
> Most of the codes in bpf_setsockopt(SOL_SOCKET) are duplicated from
> the sock_setsockopt().  The number of supported options are
> increasing ever and so as the duplicated codes.
> 
> One issue in reusing sock_setsockopt() is that the bpf prog
> has already acquired the sk lock.  sockptr_t is useful to handle this.
> sockptr_t already has a bit 'is_kernel' to handle the kernel-or-user
> memory copy.  This patch adds a 'is_bpf' bit to tell if sk locking
> has already been ensured by the bpf prog.

That is a really horrid place to hide an 'is locked' bit.

You'd be better off splitting sock_setsockopt() to add a function
that is called with sk_lock held and the value read.
That would also save the churn of all the callers.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
Stanislav Fomichev July 27, 2022, 4:47 p.m. UTC | #2
On 07/26, Martin KaFai Lau wrote:
> Most of the codes in bpf_setsockopt(SOL_SOCKET) are duplicated from
> the sock_setsockopt().  The number of supported options are
> increasing ever and so as the duplicated codes.

> One issue in reusing sock_setsockopt() is that the bpf prog
> has already acquired the sk lock.  sockptr_t is useful to handle this.
> sockptr_t already has a bit 'is_kernel' to handle the kernel-or-user
> memory copy.  This patch adds a 'is_bpf' bit to tell if sk locking
> has already been ensured by the bpf prog.

Why not explicitly call it is_locked/is_unlocked? I'm assuming, at some  
point,
we can have code paths in bpf where the socket has been already locked by
the stack?
Martin KaFai Lau July 27, 2022, 6:37 p.m. UTC | #3
On Wed, Jul 27, 2022 at 09:47:25AM -0700, sdf@google.com wrote:
> On 07/26, Martin KaFai Lau wrote:
> > Most of the codes in bpf_setsockopt(SOL_SOCKET) are duplicated from
> > the sock_setsockopt().  The number of supported options are
> > increasing ever and so as the duplicated codes.
> 
> > One issue in reusing sock_setsockopt() is that the bpf prog
> > has already acquired the sk lock.  sockptr_t is useful to handle this.
> > sockptr_t already has a bit 'is_kernel' to handle the kernel-or-user
> > memory copy.  This patch adds a 'is_bpf' bit to tell if sk locking
> > has already been ensured by the bpf prog.
> 
> Why not explicitly call it is_locked/is_unlocked? I'm assuming, at some
> point,
is_locked was my initial attempt.  The bpf_setsockopt() also skips
the ns_capable() check, like in patch 3.  I ended up using
one is_bpf bit here to do both.

> we can have code paths in bpf where the socket has been already locked by
> the stack?
hmm... You meant the opposite, like the bpf hook does not have the
lock pre-acquired before the bpf prog gets run and sock_setsockopt()
should do lock_sock() as usual?

I was thinking a likely situation is a bpf 'sleepable' hook does not
have the lock pre-acquired.  In that case, the bpf_setsockopt() could
always acquire the lock first but it may turn out to be too
pessmissitic for the future bpf_[G]etsockopt() refactoring.

or we could do this 'bit' break up (into one is_locked bit
for locked and one is_bpf to skip-capable-check).  I was waiting until a real
need comes up instead of having both bits always true now.  I don't mind to
add is_locked now since the bpf_lsm_cgroup may come to sleepable soon.
I can do this in the next spin.
Martin KaFai Lau July 27, 2022, 8:05 p.m. UTC | #4
On Wed, Jul 27, 2022 at 08:36:18AM +0000, David Laight wrote:
> From: Martin KaFai Lau
> > Sent: 27 July 2022 07:09
> > 
> > Most of the codes in bpf_setsockopt(SOL_SOCKET) are duplicated from
> > the sock_setsockopt().  The number of supported options are
> > increasing ever and so as the duplicated codes.
> > 
> > One issue in reusing sock_setsockopt() is that the bpf prog
> > has already acquired the sk lock.  sockptr_t is useful to handle this.
> > sockptr_t already has a bit 'is_kernel' to handle the kernel-or-user
> > memory copy.  This patch adds a 'is_bpf' bit to tell if sk locking
> > has already been ensured by the bpf prog.
> 
> That is a really horrid place to hide an 'is locked' bit.
> 
> You'd be better off splitting sock_setsockopt() to add a function
> that is called with sk_lock held and the value read.
> That would also save the churn of all the callers.
There is no churn to the callers after this patch, so quite
the opposite.
Stanislav Fomichev July 27, 2022, 8:39 p.m. UTC | #5
On Wed, Jul 27, 2022 at 11:37 AM Martin KaFai Lau <kafai@fb.com> wrote:
>
> On Wed, Jul 27, 2022 at 09:47:25AM -0700, sdf@google.com wrote:
> > On 07/26, Martin KaFai Lau wrote:
> > > Most of the codes in bpf_setsockopt(SOL_SOCKET) are duplicated from
> > > the sock_setsockopt().  The number of supported options are
> > > increasing ever and so as the duplicated codes.
> >
> > > One issue in reusing sock_setsockopt() is that the bpf prog
> > > has already acquired the sk lock.  sockptr_t is useful to handle this.
> > > sockptr_t already has a bit 'is_kernel' to handle the kernel-or-user
> > > memory copy.  This patch adds a 'is_bpf' bit to tell if sk locking
> > > has already been ensured by the bpf prog.
> >
> > Why not explicitly call it is_locked/is_unlocked? I'm assuming, at some
> > point,
> is_locked was my initial attempt.  The bpf_setsockopt() also skips
> the ns_capable() check, like in patch 3.  I ended up using
> one is_bpf bit here to do both.

Yeah, sorry, I haven't read the whole series before I sent my first
reply. Let's discuss it here.

This reminds me of ns_capable in __inet_bind where we also had to add
special handling.

In general, not specific to the series, I wonder if we want some new
in_bpf() context indication and bypass ns_capable() from those
contexts?
Then we can do things like:

  if (sk->sk_bound_dev_if && !in_bpf() && !ns_capable(net->user_ns,
CAP_NET_RAW))
    return ...;

Or would it make things more confusing?



> > we can have code paths in bpf where the socket has been already locked by
> > the stack?
> hmm... You meant the opposite, like the bpf hook does not have the
> lock pre-acquired before the bpf prog gets run and sock_setsockopt()
> should do lock_sock() as usual?
>
> I was thinking a likely situation is a bpf 'sleepable' hook does not
> have the lock pre-acquired.  In that case, the bpf_setsockopt() could
> always acquire the lock first but it may turn out to be too
> pessmissitic for the future bpf_[G]etsockopt() refactoring.
>
> or we could do this 'bit' break up (into one is_locked bit
> for locked and one is_bpf to skip-capable-check).  I was waiting until a real
> need comes up instead of having both bits always true now.  I don't mind to
> add is_locked now since the bpf_lsm_cgroup may come to sleepable soon.
> I can do this in the next spin.
Martin KaFai Lau July 27, 2022, 9:21 p.m. UTC | #6
On Wed, Jul 27, 2022 at 01:39:08PM -0700, Stanislav Fomichev wrote:
> On Wed, Jul 27, 2022 at 11:37 AM Martin KaFai Lau <kafai@fb.com> wrote:
> >
> > On Wed, Jul 27, 2022 at 09:47:25AM -0700, sdf@google.com wrote:
> > > On 07/26, Martin KaFai Lau wrote:
> > > > Most of the codes in bpf_setsockopt(SOL_SOCKET) are duplicated from
> > > > the sock_setsockopt().  The number of supported options are
> > > > increasing ever and so as the duplicated codes.
> > >
> > > > One issue in reusing sock_setsockopt() is that the bpf prog
> > > > has already acquired the sk lock.  sockptr_t is useful to handle this.
> > > > sockptr_t already has a bit 'is_kernel' to handle the kernel-or-user
> > > > memory copy.  This patch adds a 'is_bpf' bit to tell if sk locking
> > > > has already been ensured by the bpf prog.
> > >
> > > Why not explicitly call it is_locked/is_unlocked? I'm assuming, at some
> > > point,
> > is_locked was my initial attempt.  The bpf_setsockopt() also skips
> > the ns_capable() check, like in patch 3.  I ended up using
> > one is_bpf bit here to do both.
> 
> Yeah, sorry, I haven't read the whole series before I sent my first
> reply. Let's discuss it here.
> 
> This reminds me of ns_capable in __inet_bind where we also had to add
> special handling.
> 
> In general, not specific to the series, I wonder if we want some new
> in_bpf() context indication and bypass ns_capable() from those
> contexts?
> Then we can do things like:
> 
>   if (sk->sk_bound_dev_if && !in_bpf() && !ns_capable(net->user_ns,
> CAP_NET_RAW))
>     return ...;
Don't see a way to implement in_bpf() after some thoughts.
Do you have idea ?

> 
> Or would it make things more confusing?
> 
> 
> 
> > > we can have code paths in bpf where the socket has been already locked by
> > > the stack?
> > hmm... You meant the opposite, like the bpf hook does not have the
> > lock pre-acquired before the bpf prog gets run and sock_setsockopt()
> > should do lock_sock() as usual?
> >
> > I was thinking a likely situation is a bpf 'sleepable' hook does not
> > have the lock pre-acquired.  In that case, the bpf_setsockopt() could
> > always acquire the lock first but it may turn out to be too
> > pessmissitic for the future bpf_[G]etsockopt() refactoring.
> >
> > or we could do this 'bit' break up (into one is_locked bit
> > for locked and one is_bpf to skip-capable-check).  I was waiting until a real
> > need comes up instead of having both bits always true now.  I don't mind to
> > add is_locked now since the bpf_lsm_cgroup may come to sleepable soon.
> > I can do this in the next spin.
Stanislav Fomichev July 27, 2022, 9:38 p.m. UTC | #7
On Wed, Jul 27, 2022 at 2:21 PM Martin KaFai Lau <kafai@fb.com> wrote:
>
> On Wed, Jul 27, 2022 at 01:39:08PM -0700, Stanislav Fomichev wrote:
> > On Wed, Jul 27, 2022 at 11:37 AM Martin KaFai Lau <kafai@fb.com> wrote:
> > >
> > > On Wed, Jul 27, 2022 at 09:47:25AM -0700, sdf@google.com wrote:
> > > > On 07/26, Martin KaFai Lau wrote:
> > > > > Most of the codes in bpf_setsockopt(SOL_SOCKET) are duplicated from
> > > > > the sock_setsockopt().  The number of supported options are
> > > > > increasing ever and so as the duplicated codes.
> > > >
> > > > > One issue in reusing sock_setsockopt() is that the bpf prog
> > > > > has already acquired the sk lock.  sockptr_t is useful to handle this.
> > > > > sockptr_t already has a bit 'is_kernel' to handle the kernel-or-user
> > > > > memory copy.  This patch adds a 'is_bpf' bit to tell if sk locking
> > > > > has already been ensured by the bpf prog.
> > > >
> > > > Why not explicitly call it is_locked/is_unlocked? I'm assuming, at some
> > > > point,
> > > is_locked was my initial attempt.  The bpf_setsockopt() also skips
> > > the ns_capable() check, like in patch 3.  I ended up using
> > > one is_bpf bit here to do both.
> >
> > Yeah, sorry, I haven't read the whole series before I sent my first
> > reply. Let's discuss it here.
> >
> > This reminds me of ns_capable in __inet_bind where we also had to add
> > special handling.
> >
> > In general, not specific to the series, I wonder if we want some new
> > in_bpf() context indication and bypass ns_capable() from those
> > contexts?
> > Then we can do things like:
> >
> >   if (sk->sk_bound_dev_if && !in_bpf() && !ns_capable(net->user_ns,
> > CAP_NET_RAW))
> >     return ...;
> Don't see a way to implement in_bpf() after some thoughts.
> Do you have idea ?

I wonder if we can cheat a bit with the following:

bool setsockopt_capable(struct user_namespace *ns, int cap)
{
       if (!in_task()) {
             /* Running in irq/softirq -> setsockopt invoked by bpf program.
              * [not sure, is it safe to assume no regular path leads
to setsockopt from sirq?]
              */
             return true;
       }

       /* Running in process context, task has bpf_ctx set -> invoked
by bpf program. */
       if (current->bpf_ctx != NULL)
             return true;

       return ns_capable(ns, cap);
}

And then do /ns_capable/setsockopt_capable/ in net/core/sock.c

But that might be more fragile than passing the flag, idk.

> > Or would it make things more confusing?
> >
> >
> >
> > > > we can have code paths in bpf where the socket has been already locked by
> > > > the stack?
> > > hmm... You meant the opposite, like the bpf hook does not have the
> > > lock pre-acquired before the bpf prog gets run and sock_setsockopt()
> > > should do lock_sock() as usual?
> > >
> > > I was thinking a likely situation is a bpf 'sleepable' hook does not
> > > have the lock pre-acquired.  In that case, the bpf_setsockopt() could
> > > always acquire the lock first but it may turn out to be too
> > > pessmissitic for the future bpf_[G]etsockopt() refactoring.
> > >
> > > or we could do this 'bit' break up (into one is_locked bit
> > > for locked and one is_bpf to skip-capable-check).  I was waiting until a real
> > > need comes up instead of having both bits always true now.  I don't mind to
> > > add is_locked now since the bpf_lsm_cgroup may come to sleepable soon.
> > > I can do this in the next spin.
Martin KaFai Lau July 28, 2022, 12:45 a.m. UTC | #8
On Wed, Jul 27, 2022 at 02:38:51PM -0700, Stanislav Fomichev wrote:
> On Wed, Jul 27, 2022 at 2:21 PM Martin KaFai Lau <kafai@fb.com> wrote:
> >
> > On Wed, Jul 27, 2022 at 01:39:08PM -0700, Stanislav Fomichev wrote:
> > > On Wed, Jul 27, 2022 at 11:37 AM Martin KaFai Lau <kafai@fb.com> wrote:
> > > >
> > > > On Wed, Jul 27, 2022 at 09:47:25AM -0700, sdf@google.com wrote:
> > > > > On 07/26, Martin KaFai Lau wrote:
> > > > > > Most of the codes in bpf_setsockopt(SOL_SOCKET) are duplicated from
> > > > > > the sock_setsockopt().  The number of supported options are
> > > > > > increasing ever and so as the duplicated codes.
> > > > >
> > > > > > One issue in reusing sock_setsockopt() is that the bpf prog
> > > > > > has already acquired the sk lock.  sockptr_t is useful to handle this.
> > > > > > sockptr_t already has a bit 'is_kernel' to handle the kernel-or-user
> > > > > > memory copy.  This patch adds a 'is_bpf' bit to tell if sk locking
> > > > > > has already been ensured by the bpf prog.
> > > > >
> > > > > Why not explicitly call it is_locked/is_unlocked? I'm assuming, at some
> > > > > point,
> > > > is_locked was my initial attempt.  The bpf_setsockopt() also skips
> > > > the ns_capable() check, like in patch 3.  I ended up using
> > > > one is_bpf bit here to do both.
> > >
> > > Yeah, sorry, I haven't read the whole series before I sent my first
> > > reply. Let's discuss it here.
> > >
> > > This reminds me of ns_capable in __inet_bind where we also had to add
> > > special handling.
> > >
> > > In general, not specific to the series, I wonder if we want some new
> > > in_bpf() context indication and bypass ns_capable() from those
> > > contexts?
> > > Then we can do things like:
> > >
> > >   if (sk->sk_bound_dev_if && !in_bpf() && !ns_capable(net->user_ns,
> > > CAP_NET_RAW))
> > >     return ...;
> > Don't see a way to implement in_bpf() after some thoughts.
> > Do you have idea ?
> 
> I wonder if we can cheat a bit with the following:
> 
> bool setsockopt_capable(struct user_namespace *ns, int cap)
> {
>        if (!in_task()) {
>              /* Running in irq/softirq -> setsockopt invoked by bpf program.
>               * [not sure, is it safe to assume no regular path leads
> to setsockopt from sirq?]
>               */
>              return true;
>        }
> 
>        /* Running in process context, task has bpf_ctx set -> invoked
> by bpf program. */
>        if (current->bpf_ctx != NULL)
>              return true;
> 
>        return ns_capable(ns, cap);
> }
> 
> And then do /ns_capable/setsockopt_capable/ in net/core/sock.c
> 
> But that might be more fragile than passing the flag, idk.
I think it should work.  From a quick look, all bpf_setsockopt usage has
bpf_ctx.  The one from bpf_tcp_ca (struct_ops) and bpf_iter is trampoline
which also has bpf_ctx.  Not sure about the future use cases.

To be honest, I am not sure if I have missed cases and also have similar questions
your have in the above sample code.  This may deserve a separate patch
set for discussion.  Using a bit in sockptr is mostly free now.
WDYT ?

> 
> > > Or would it make things more confusing?
> > >
> > >
> > >
> > > > > we can have code paths in bpf where the socket has been already locked by
> > > > > the stack?
> > > > hmm... You meant the opposite, like the bpf hook does not have the
> > > > lock pre-acquired before the bpf prog gets run and sock_setsockopt()
> > > > should do lock_sock() as usual?
> > > >
> > > > I was thinking a likely situation is a bpf 'sleepable' hook does not
> > > > have the lock pre-acquired.  In that case, the bpf_setsockopt() could
> > > > always acquire the lock first but it may turn out to be too
> > > > pessmissitic for the future bpf_[G]etsockopt() refactoring.
> > > >
> > > > or we could do this 'bit' break up (into one is_locked bit
> > > > for locked and one is_bpf to skip-capable-check).  I was waiting until a real
> > > > need comes up instead of having both bits always true now.  I don't mind to
> > > > add is_locked now since the bpf_lsm_cgroup may come to sleepable soon.
> > > > I can do this in the next spin.
Jakub Kicinski July 28, 2022, 1:49 a.m. UTC | #9
On Wed, 27 Jul 2022 17:45:46 -0700 Martin KaFai Lau wrote:
> > bool setsockopt_capable(struct user_namespace *ns, int cap)
> > {
> >        if (!in_task()) {
> >              /* Running in irq/softirq -> setsockopt invoked by bpf program.
> >               * [not sure, is it safe to assume no regular path leads
> > to setsockopt from sirq?]
> >               */
> >              return true;
> >        }
> > 
> >        /* Running in process context, task has bpf_ctx set -> invoked
> > by bpf program. */
> >        if (current->bpf_ctx != NULL)
> >              return true;
> > 
> >        return ns_capable(ns, cap);
> > }
> > 
> > And then do /ns_capable/setsockopt_capable/ in net/core/sock.c
> > 
> > But that might be more fragile than passing the flag, idk.  
> I think it should work.  From a quick look, all bpf_setsockopt usage has
> bpf_ctx.  The one from bpf_tcp_ca (struct_ops) and bpf_iter is trampoline
> which also has bpf_ctx.  Not sure about the future use cases.
> 
> To be honest, I am not sure if I have missed cases and also have similar questions
> your have in the above sample code.  This may deserve a separate patch
> set for discussion.  Using a bit in sockptr is mostly free now.
> WDYT ?

Sorry to chime in but I vote against @in_bpf. I had to search the git
history recently to figure out what SK_USER_DATA_BPF means. It's not
going to be obvious to a networking person what semantics to attribute
to "in bpf".
Martin KaFai Lau July 28, 2022, 4:31 p.m. UTC | #10
On Wed, Jul 27, 2022 at 06:49:03PM -0700, Jakub Kicinski wrote:
> On Wed, 27 Jul 2022 17:45:46 -0700 Martin KaFai Lau wrote:
> > > bool setsockopt_capable(struct user_namespace *ns, int cap)
> > > {
> > >        if (!in_task()) {
> > >              /* Running in irq/softirq -> setsockopt invoked by bpf program.
> > >               * [not sure, is it safe to assume no regular path leads
> > > to setsockopt from sirq?]
> > >               */
> > >              return true;
> > >        }
> > > 
> > >        /* Running in process context, task has bpf_ctx set -> invoked
> > > by bpf program. */
> > >        if (current->bpf_ctx != NULL)
> > >              return true;
> > > 
> > >        return ns_capable(ns, cap);
> > > }
> > > 
> > > And then do /ns_capable/setsockopt_capable/ in net/core/sock.c
> > > 
> > > But that might be more fragile than passing the flag, idk.  
> > I think it should work.  From a quick look, all bpf_setsockopt usage has
> > bpf_ctx.  The one from bpf_tcp_ca (struct_ops) and bpf_iter is trampoline
> > which also has bpf_ctx.  Not sure about the future use cases.
> > 
> > To be honest, I am not sure if I have missed cases and also have similar questions
> > your have in the above sample code.  This may deserve a separate patch
> > set for discussion.  Using a bit in sockptr is mostly free now.
> > WDYT ?
> 
> Sorry to chime in but I vote against @in_bpf. I had to search the git
> history recently to figure out what SK_USER_DATA_BPF means. It's not
> going to be obvious to a networking person what semantics to attribute
> to "in bpf".
If I understand the concern correctly, it may not be straight forward to
grip the reason behind the testings at in_bpf() [ the in_task() and
the current->bpf_ctx test ] ?  Yes, it is a valid point.

The optval.is_bpf bit can be directly traced back to the bpf_setsockopt
helper and should be easier to reason about.
Jakub Kicinski July 28, 2022, 4:56 p.m. UTC | #11
On Thu, 28 Jul 2022 09:31:04 -0700 Martin KaFai Lau wrote:
> If I understand the concern correctly, it may not be straight forward to
> grip the reason behind the testings at in_bpf() [ the in_task() and
> the current->bpf_ctx test ] ?  Yes, it is a valid point.
> 
> The optval.is_bpf bit can be directly traced back to the bpf_setsockopt
> helper and should be easier to reason about.

I think we're saying the opposite thing. in_bpf() the context checking
function is fine. There is a clear parallel to in_task() and combined
with the capability check it should be pretty obvious what the code
is intending to achieve.

sockptr_t::in_bpf which randomly implies that the lock is already held
will be hard to understand for anyone not intimately familiar with the
BPF code. Naming that bit is_locked seems much clearer.

Which is what I believe Stan was proposing.
Martin KaFai Lau July 28, 2022, 5:20 p.m. UTC | #12
On Thu, Jul 28, 2022 at 09:56:29AM -0700, Jakub Kicinski wrote:
> On Thu, 28 Jul 2022 09:31:04 -0700 Martin KaFai Lau wrote:
> > If I understand the concern correctly, it may not be straight forward to
> > grip the reason behind the testings at in_bpf() [ the in_task() and
> > the current->bpf_ctx test ] ?  Yes, it is a valid point.
> > 
> > The optval.is_bpf bit can be directly traced back to the bpf_setsockopt
> > helper and should be easier to reason about.
> 
> I think we're saying the opposite thing. in_bpf() the context checking
> function is fine. There is a clear parallel to in_task() and combined
> with the capability check it should be pretty obvious what the code
> is intending to achieve.
> 
> sockptr_t::in_bpf which randomly implies that the lock is already held
> will be hard to understand for anyone not intimately familiar with the
> BPF code. Naming that bit is_locked seems much clearer.
> 
> Which is what I believe Stan was proposing.
Yeah, I think I read the 'vote against @in_bpf' in the other way. :)

Sure. I will do s/is_bpf/is_locked/ and do the in_bpf() context
checking before ns_capable().
Jakub Kicinski July 28, 2022, 5:40 p.m. UTC | #13
On Thu, 28 Jul 2022 10:20:04 -0700 Martin KaFai Lau wrote:
> > Which is what I believe Stan was proposing.  
> Yeah, I think I read the 'vote against @in_bpf' in the other way. :)

My bad, I didn't read the proposal in sufficient detail to realize 
the helper is called the same thing as the bit :D
David Laight July 29, 2022, 10:04 a.m. UTC | #14
From: Jakub Kicinski
> Sent: 28 July 2022 17:56
> 
> On Thu, 28 Jul 2022 09:31:04 -0700 Martin KaFai Lau wrote:
> > If I understand the concern correctly, it may not be straight forward to
> > grip the reason behind the testings at in_bpf() [ the in_task() and
> > the current->bpf_ctx test ] ?  Yes, it is a valid point.
> >
> > The optval.is_bpf bit can be directly traced back to the bpf_setsockopt
> > helper and should be easier to reason about.
> 
> I think we're saying the opposite thing. in_bpf() the context checking
> function is fine. There is a clear parallel to in_task() and combined
> with the capability check it should be pretty obvious what the code
> is intending to achieve.
> 
> sockptr_t::in_bpf which randomly implies that the lock is already held
> will be hard to understand for anyone not intimately familiar with the
> BPF code. Naming that bit is_locked seems much clearer.
> 
> Which is what I believe Stan was proposing.

Or make sk_setsockopt() be called after the integer value
has been read and with the lock held.

That saves any (horrid) conditional locking.

Also sockptr_t should probably have been a structure with separate
user and kernel address fields.
Putting the length in there would (probably) save code.

There then might be scope for pre-copying short user buffers
into a kernel buffer while still allowing the requests that
ignore the length copy directly from a user buffer.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
Martin KaFai Lau July 29, 2022, 7:06 p.m. UTC | #15
On Fri, Jul 29, 2022 at 10:04:29AM +0000, David Laight wrote:
> From: Jakub Kicinski
> > Sent: 28 July 2022 17:56
> > 
> > On Thu, 28 Jul 2022 09:31:04 -0700 Martin KaFai Lau wrote:
> > > If I understand the concern correctly, it may not be straight forward to
> > > grip the reason behind the testings at in_bpf() [ the in_task() and
> > > the current->bpf_ctx test ] ?  Yes, it is a valid point.
> > >
> > > The optval.is_bpf bit can be directly traced back to the bpf_setsockopt
> > > helper and should be easier to reason about.
> > 
> > I think we're saying the opposite thing. in_bpf() the context checking
> > function is fine. There is a clear parallel to in_task() and combined
> > with the capability check it should be pretty obvious what the code
> > is intending to achieve.
> > 
> > sockptr_t::in_bpf which randomly implies that the lock is already held
> > will be hard to understand for anyone not intimately familiar with the
> > BPF code. Naming that bit is_locked seems much clearer.
> > 
> > Which is what I believe Stan was proposing.
> 
> Or make sk_setsockopt() be called after the integer value
> has been read and with the lock held.
> 
> That saves any (horrid) conditional locking.
> 
> Also sockptr_t should probably have been a structure with separate
> user and kernel address fields.
> Putting the length in there would (probably) save code.
> 
> There then might be scope for pre-copying short user buffers
> into a kernel buffer while still allowing the requests that
> ignore the length copy directly from a user buffer.
Some optnames take its own lock.  e.g. some in do_tcp_setsockopt.
Those will need to be broken down to its own locked and unlocked functions.
Not only setsockopt, this applies to the future [g]etsockopt() refactoring also
where most optnames are not under one lock_sock() and each optname could take
the lock or release it in its own optname.  

imo, this is unnecessary code churn for long switching cases like
setsockopt without a clear benefit.  While the patch is not the first
conditional locking in the kernel, I would like to hear how others think
about doing this in a helper like lock_sock_sockopt() for set/getsockopt().

With in_bpf() helper suggested by Stan, the is_locked can be passed
as one additional argument instead.  Then there is no need to change
the sockptr_t and leave sockptr_t to contain the optval itself.
diff mbox series

Patch

diff --git a/include/linux/sockptr.h b/include/linux/sockptr.h
index d45902fb4cad..787d0be08fb7 100644
--- a/include/linux/sockptr.h
+++ b/include/linux/sockptr.h
@@ -16,7 +16,8 @@  typedef struct {
 		void		*kernel;
 		void __user	*user;
 	};
-	bool		is_kernel : 1;
+	bool		is_kernel : 1,
+			is_bpf : 1;
 } sockptr_t;
 
 static inline bool sockptr_is_kernel(sockptr_t sockptr)
@@ -24,6 +25,11 @@  static inline bool sockptr_is_kernel(sockptr_t sockptr)
 	return sockptr.is_kernel;
 }
 
+static inline sockptr_t KERNEL_SOCKPTR_BPF(void *p)
+{
+	return (sockptr_t) { .kernel = p, .is_kernel = true, .is_bpf = true };
+}
+
 static inline sockptr_t KERNEL_SOCKPTR(void *p)
 {
 	return (sockptr_t) { .kernel = p, .is_kernel = true };
diff --git a/include/net/sock.h b/include/net/sock.h
index 9e2539dcc293..2f913bdf0035 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1721,6 +1721,18 @@  static inline void unlock_sock_fast(struct sock *sk, bool slow)
 	}
 }
 
+static inline void lock_sock_sockopt(struct sock *sk, sockptr_t optval)
+{
+	if (!optval.is_bpf)
+		lock_sock(sk);
+}
+
+static inline void release_sock_sockopt(struct sock *sk, sockptr_t optval)
+{
+	if (!optval.is_bpf)
+		release_sock(sk);
+}
+
 /* Used by processes to "lock" a socket state, so that
  * interrupts and bottom half handlers won't change it
  * from under us. It essentially blocks any incoming
diff --git a/net/core/sock.c b/net/core/sock.c
index 18bb4f269cf1..61d927a5f6cb 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -703,7 +703,9 @@  static int sock_setbindtodevice(struct sock *sk, sockptr_t optval, int optlen)
 			goto out;
 	}
 
-	return sock_bindtoindex(sk, index, true);
+	lock_sock_sockopt(sk, optval);
+	ret = sock_bindtoindex_locked(sk, index);
+	release_sock_sockopt(sk, optval);
 out:
 #endif
 
@@ -1067,7 +1069,7 @@  int sock_setsockopt(struct sock *sk, int level, int optname,
 
 	valbool = val ? 1 : 0;
 
-	lock_sock(sk);
+	lock_sock_sockopt(sk, optval);
 
 	switch (optname) {
 	case SO_DEBUG:
@@ -1496,7 +1498,7 @@  int sock_setsockopt(struct sock *sk, int level, int optname,
 		ret = -ENOPROTOOPT;
 		break;
 	}
-	release_sock(sk);
+	release_sock_sockopt(sk, optval);
 	return ret;
 }
 EXPORT_SYMBOL(sock_setsockopt);