mbox series

[GIT,PULL] SELinux patches for v5.8

Message ID CAHC9VhTX8gkUui6AiTJMJgcohXa=TOqdO==rEDk=Mquz9sCNKA@mail.gmail.com (mailing list archive)
State Accepted
Headers show
Series [GIT,PULL] SELinux patches for v5.8 | expand

Pull-request

git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux.git tags/selinux-pr-20200601

Message

Paul Moore June 2, 2020, 1:06 a.m. UTC
Hi Linus,

Here are the SELinux patches for v5.8.  All the patches pass our test
suite and as of about 15 minutes ago they apply cleanly to the top of
your tree.  Please merge into v5.8.

The highlights:

- A number of improvements to various SELinux internal data structures
to help improve performance.  We move the role transitions into a hash
table.  In the content structure we shift from hashing the content
string (aka SELinux label) to the structure itself, when it is valid.
This last change not only offers a speedup, but it helps us simplify
the code some as well.

- Add a new SELinux policy version which allows for a more space
efficient way of storing the filename transitions in the binary
policy.  Given the default Fedora SELinux policy with the unconfined
module enabled, this change drops the policy size from ~7.6MB to
~3.3MB.  The kernel policy load time dropped as well.

- Some fixes to the error handling code in the policy parser to
properly return error codes when things go wrong.

Thanks,
-Paul

--
The following changes since commit 8f3d9f354286745c751374f5f1fcafee6b3f3136:

 Linux 5.7-rc1 (2020-04-12 12:35:55 -0700)

are available in the Git repository at:

 git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux.git
   tags/selinux-pr-20200601

for you to fetch changes up to fe5a90b8c14914397a3bb0c214d142103c1ba3bf:

 selinux: netlabel: Remove unused inline function (2020-05-12 20:16:33 -0400)

----------------------------------------------------------------
selinux/stable-5.8 PR 20200601

----------------------------------------------------------------
Ondrej Mosnacek (9):
     selinux: drop unnecessary smp_load_acquire() call
     selinux: store role transitions in a hash table
     selinux: hash context structure directly
     selinux: move context hashing under sidtab
     selinux: implement new format of filename transitions
     selinux: don't produce incorrect filename_trans_count
     selinux: simplify range_write()
     selinux: fix return value on error in policydb_read()
     selinux: do not allocate hashtabs dynamically

Wei Yongjun (1):
     selinux: fix error return code in policydb_read()

YueHaibing (1):
     selinux: netlabel: Remove unused inline function

Zou Wei (1):
     selinux: fix warning Comparison to bool

security/selinux/Makefile           |   2 +-
security/selinux/include/netlabel.h |   6 -
security/selinux/include/security.h |   3 +-
security/selinux/ss/context.c       |  32 +++
security/selinux/ss/context.h       |  11 +-
security/selinux/ss/ebitmap.c       |  14 ++
security/selinux/ss/ebitmap.h       |   1 +
security/selinux/ss/hashtab.c       |  51 ++--
security/selinux/ss/hashtab.h       |  13 +-
security/selinux/ss/mls.c           |  16 +-
security/selinux/ss/mls.h           |  11 +
security/selinux/ss/policydb.c      | 451 +++++++++++++++++++++++---------
security/selinux/ss/policydb.h      |  15 +-
security/selinux/ss/services.c      | 153 +++++-------
security/selinux/ss/services.h      |   3 -
security/selinux/ss/sidtab.c        |  35 +--
security/selinux/ss/sidtab.h        |   1 +
security/selinux/ss/symtab.c        |   5 +-
security/selinux/ss/symtab.h        |   2 +-
19 files changed, 499 insertions(+), 326 deletions(-)
create mode 100644 security/selinux/ss/context.c

Comments

Linus Torvalds June 3, 2020, 12:31 a.m. UTC | #1
On Mon, Jun 1, 2020 at 6:07 PM Paul Moore <paul@paul-moore.com> wrote:
>
> - A number of improvements to various SELinux internal data structures
> to help improve performance.  We move the role transitions into a hash
> table.  In the content structure we shift from hashing the content
> string (aka SELinux label) to the structure itself, when it is valid.
> This last change not only offers a speedup, but it helps us simplify
> the code some as well.

Side note since you mention performance work: in the past when I've
looked at SELinux performance (generally as part of pathname lookup
etc VFS loads), the biggest cost by far was that all the SELinux data
structures take a ton of cache misses.

Yes, some of the hashing shows up in the profiles, but _most_ of it
was loading the data from inode->i_security etc.

And the reason seemed to be that every single inode ends up having a
separately allocated "struct inode_security_struct" (aka "isec"). Even
if the contents are often all exactly the same for a large set of
inodes that thus _could_ conceptually share the data.

Now, it used to be - before being able to stack security layers -
SElinux would control that pointer, and it could have done some kind
of sharing scheme with copy-on-write behavior (the way we do 'struct
cred' for processes), and it would have caused a much smaller cache
footprint (and thus likely much fewer cache misses).

These days, that sharing of the i_security pointer across different
security layers makes that sound really really painful.

But I do wonder if anybody in selinux land (or general security
subsystem land) has been thinking of maybe at least having a "this
inode has no special labeling" marker that could possibly avoid having
all those extra allocations.

Because it really does end up being visible in profiles how costly it
is to look up any data behind inode->i_security.

               Linus
pr-tracker-bot@kernel.org June 3, 2020, 1:10 a.m. UTC | #2
The pull request you sent on Mon, 1 Jun 2020 21:06:48 -0400:

> git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux.git tags/selinux-pr-20200601

has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/f41030a20b38552a2da3b3f6bc9e7a78637d6c23

Thank you!
Casey Schaufler June 3, 2020, 5:20 p.m. UTC | #3
On 6/2/2020 5:31 PM, Linus Torvalds wrote:
> On Mon, Jun 1, 2020 at 6:07 PM Paul Moore <paul@paul-moore.com> wrote:
>> - A number of improvements to various SELinux internal data structures
>> to help improve performance.  We move the role transitions into a hash
>> table.  In the content structure we shift from hashing the content
>> string (aka SELinux label) to the structure itself, when it is valid.
>> This last change not only offers a speedup, but it helps us simplify
>> the code some as well.
> Side note since you mention performance work: in the past when I've
> looked at SELinux performance (generally as part of pathname lookup
> etc VFS loads), the biggest cost by far was that all the SELinux data
> structures take a ton of cache misses.
>
> Yes, some of the hashing shows up in the profiles, but _most_ of it
> was loading the data from inode->i_security etc.

The whole security blob scheme is based on the 20th century notion
that security beyond user identification lives in the realm of the
lunatic fringe. The use of security modules was expected to be rare.
Clearly we've moved on from that. A system without a security module
is the exception, not the rule.

> And the reason seemed to be that every single inode ends up having a
> separately allocated "struct inode_security_struct" (aka "isec"). Even
> if the contents are often all exactly the same for a large set of
> inodes that thus _could_ conceptually share the data.

There's state information as well as the access control attributes
in the SELinux and Smack blobs.

> Now, it used to be - before being able to stack security layers -
> SElinux would control that pointer, and it could have done some kind
> of sharing scheme with copy-on-write behavior (the way we do 'struct
> cred' for processes), and it would have caused a much smaller cache
> footprint (and thus likely much fewer cache misses).
>
> These days, that sharing of the i_security pointer across different
> security layers makes that sound really really painful.

Dealing with shared creds is reasonably painful, too.

> But I do wonder if anybody in selinux land (or general security
> subsystem land) has been thinking of maybe at least having a "this
> inode has no special labeling" marker that could possibly avoid having
> all those extra allocations.

The fine granularity of SELinux policy isn't conducive to that.
There's also the state information to deal with. 

> Because it really does end up being visible in profiles how costly it
> is to look up any data behind inode->i_security.

We could have inode->i_security be the blob, rather than a pointer to it.
That will have its own performance issues. I proposed doing just that in
2013, to the resounding sound of crickets.

I am painfully aware of the performance considerations that influence
the way we've had to implement security modules. But these
considerations have always emphasized the performance when security
modules are *not* being used, not when they are. I am encouraged to see
that being questioned. I believe that we can do better, but that some
of the constraints we've had to work under need to come off before we
can do a good job of it.

>                Linus
Linus Torvalds June 3, 2020, 5:37 p.m. UTC | #4
On Wed, Jun 3, 2020 at 10:20 AM Casey Schaufler <casey@schaufler-ca.com> wrote:
>
> We could have inode->i_security be the blob, rather than a pointer to it.
> That will have its own performance issues.

It wouldn't actually really fix anything, because the inode is so big
and sparsely accessed that it doesn't even really help the cache
density issue. Yeah, it gets rid of the pointer access, but that's
pretty much it. The fact that we randomize the order means that we
can't even really try to aim for any cache density.

And it would actually not be possible with the current layered
security model anyway, since those blob sizes are dynamic at runtime.

If we had _only_ SELinux, we could perhaps have hidden the
sid/sclass/task_sid directly in the inode (it would be only slightly
larger than the pointer is, anyway), but even that ship sailed long
long ago due to the whole "no security person can ever agree with
another one on fundamentals".

So don't try to blame the rest of the system design. This is on the
security people. We've been able to handle other layers fairly well
because they generally agree on fundamentals (although it can take
decades before they then end up merging their code - things like the
filesystem people standardizing on iomap and other core concepts). And
as mentioned, when there is agreed-upon security rules (ie "struct
cred") we've been able to spend the effort to architect it so that it
doesn't add unnecessary overheads.

             Linus
Casey Schaufler June 3, 2020, 6:05 p.m. UTC | #5
On 6/3/2020 10:37 AM, Linus Torvalds wrote:
> On Wed, Jun 3, 2020 at 10:20 AM Casey Schaufler <casey@schaufler-ca.com> wrote:
>> We could have inode->i_security be the blob, rather than a pointer to it.
>> That will have its own performance issues.
> It wouldn't actually really fix anything, because the inode is so big
> and sparsely accessed that it doesn't even really help the cache
> density issue. Yeah, it gets rid of the pointer access, but that's
> pretty much it. The fact that we randomize the order means that we
> can't even really try to aim for any cache density.

Well, it was a thought.

> And it would actually not be possible with the current layered
> security model anyway, since those blob sizes are dynamic at runtime.

The model would have to change. The dynamic blob size is an artifact
of the model, not a driver.

> If we had _only_ SELinux, we could perhaps have hidden the
> sid/sclass/task_sid directly in the inode (it would be only slightly
> larger than the pointer is, anyway), but even that ship sailed long
> long ago due to the whole "no security person can ever agree with
> another one on fundamentals".

Not to mention that the security landscape keeps changing.

> So don't try to blame the rest of the system design.

That wasn't my intent. Apologies.

>  This is on the
> security people. We've been able to handle other layers fairly well
> because they generally agree on fundamentals (although it can take
> decades before they then end up merging their code - things like the
> filesystem people standardizing on iomap and other core concepts). And
> as mentioned, when there is agreed-upon security rules (ie "struct
> cred") we've been able to spend the effort to architect it so that it
> doesn't add unnecessary overheads.
>
>              Linus
Alexei Starovoitov June 3, 2020, 9:02 p.m. UTC | #6
On Wed, Jun 03, 2020 at 10:37:47AM -0700, Linus Torvalds wrote:
> On Wed, Jun 3, 2020 at 10:20 AM Casey Schaufler <casey@schaufler-ca.com> wrote:
> >
> > We could have inode->i_security be the blob, rather than a pointer to it.
> > That will have its own performance issues.
> 
> It wouldn't actually really fix anything, because the inode is so big
> and sparsely accessed that it doesn't even really help the cache
> density issue. Yeah, it gets rid of the pointer access, but that's
> pretty much it. The fact that we randomize the order means that we
> can't even really try to aim for any cache density.
> 
> And it would actually not be possible with the current layered
> security model anyway, since those blob sizes are dynamic at runtime.
> 
> If we had _only_ SELinux, we could perhaps have hidden the
> sid/sclass/task_sid directly in the inode (it would be only slightly
> larger than the pointer is, anyway), but even that ship sailed long
> long ago due to the whole "no security person can ever agree with
> another one on fundamentals".

Also there is bpf_lsm now that we're going to run it in production, 
so performance is as important as ever.
Traditional lsm-s have per-lsm per-inode blob.
For bpf that doesn't work, since progs come and go at run-time and
independent from each other.
So we need per-program per-inode blob.
To maintain good performance we've proposed:
@@ -740,6 +741,10 @@  struct inode {
 	struct fsverity_info	*i_verity_info;
 #endif
 
+#ifdef CONFIG_BPF_SYSCALL
+	struct bpf_local_storage __rcu	*inode_bpf_storage;
+#endif

https://patchwork.ozlabs.org/project/netdev/patch/20200526163336.63653-3-kpsingh@chromium.org/

but got pushback, so we're going to use lsm style for now:
+static inline struct bpf_lsm_storage *bpf_inode(const struct inode *inode)
+{
+       if (unlikely(!inode->i_security))
+               return NULL;
+
+       return inode->i_security + bpf_lsm_blob_sizes.lbs_inode;
+}

which means extra kmalloc for every inode, extra pointer deref, global var
access, and additional math just to get to 'inode_bpf_storage' pointer.

We have similar pointer in 'struct sock' already:
#ifdef CONFIG_BPF_SYSCALL
        struct bpf_sk_storage __rcu     *sk_bpf_storage;
#endif
that is used by variety of networking bpf programs.
The commit 6ac99e8f23d4 ("bpf: Introduce bpf sk local storage")
has benchmarking data for it:
  hash table with 8-byte key -> 152ns per bpf run 
  sk_bpf_storage -> 66ns per bpf run
Hashtable suppose to be O(1) with L1$ hit, but it turned out
to be noticeably slower than sk_bpf_storage.
We expect to see similar performance gains for inode_bpf_storage
vs hashtable approach that people use now.
Eventually we'll add task_bpf_storage as well.
Right now every other bpf tracing script is using pid as a key
in a separate hash table to store per-task data. For high frequency
events that adds up. task_bpf_storage will accelerate that.

Another way to look at it is shared inode->i_security across
different inodes won't work for us. We need something really
cheap like single 'inode_bpf_storage' pointer that is zero
most of the time and for few inodes bpf progs will keep their
scratch data in there.
For now lsm style bpf_inode() approach is ok-ish.
But we will come back when we collect perf numbers to justify
why direct pointer in the 'struct inode' is a win.
Casey Schaufler June 3, 2020, 10 p.m. UTC | #7
On 6/3/2020 2:02 PM, Alexei Starovoitov wrote:
> On Wed, Jun 03, 2020 at 10:37:47AM -0700, Linus Torvalds wrote:
>> On Wed, Jun 3, 2020 at 10:20 AM Casey Schaufler <casey@schaufler-ca.com> wrote:
>>> We could have inode->i_security be the blob, rather than a pointer to it.
>>> That will have its own performance issues.
>> It wouldn't actually really fix anything, because the inode is so big
>> and sparsely accessed that it doesn't even really help the cache
>> density issue. Yeah, it gets rid of the pointer access, but that's
>> pretty much it. The fact that we randomize the order means that we
>> can't even really try to aim for any cache density.
>>
>> And it would actually not be possible with the current layered
>> security model anyway, since those blob sizes are dynamic at runtime.
>>
>> If we had _only_ SELinux, we could perhaps have hidden the
>> sid/sclass/task_sid directly in the inode (it would be only slightly
>> larger than the pointer is, anyway), but even that ship sailed long
>> long ago due to the whole "no security person can ever agree with
>> another one on fundamentals".
> Also there is bpf_lsm now that we're going to run it in production, 
> so performance is as important as ever.
> Traditional lsm-s have per-lsm per-inode blob.
> For bpf that doesn't work, since progs come and go at run-time and
> independent from each other.

The inode blob is for attributes associated with the filesystem object.
There are cred and task blobs for program information.
If you need separate per-task data you should put it there.

> So we need per-program per-inode blob.
> To maintain good performance we've proposed:
> @@ -740,6 +741,10 @@  struct inode {
>  	struct fsverity_info	*i_verity_info;
>  #endif
>  
> +#ifdef CONFIG_BPF_SYSCALL
> +	struct bpf_local_storage __rcu	*inode_bpf_storage;
> +#endif
>
> https://patchwork.ozlabs.org/project/netdev/patch/20200526163336.63653-3-kpsingh@chromium.org/
>
> but got pushback, so we're going to use lsm style for now:
> +static inline struct bpf_lsm_storage *bpf_inode(const struct inode *inode)
> +{
> +       if (unlikely(!inode->i_security))
> +               return NULL;
> +
> +       return inode->i_security + bpf_lsm_blob_sizes.lbs_inode;
> +}
>
> which means extra kmalloc for every inode, extra pointer deref, global var
> access, and additional math just to get to 'inode_bpf_storage' pointer.
>
> We have similar pointer in 'struct sock' already:
> #ifdef CONFIG_BPF_SYSCALL
>         struct bpf_sk_storage __rcu     *sk_bpf_storage;
> #endif
> that is used by variety of networking bpf programs.
> The commit 6ac99e8f23d4 ("bpf: Introduce bpf sk local storage")
> has benchmarking data for it:
>   hash table with 8-byte key -> 152ns per bpf run 
>   sk_bpf_storage -> 66ns per bpf run
> Hashtable suppose to be O(1) with L1$ hit, but it turned out
> to be noticeably slower than sk_bpf_storage.
> We expect to see similar performance gains for inode_bpf_storage
> vs hashtable approach that people use now.
> Eventually we'll add task_bpf_storage as well.
> Right now every other bpf tracing script is using pid as a key
> in a separate hash table to store per-task data. For high frequency
> events that adds up. task_bpf_storage will accelerate that.

Why aren't you using a task blob? We have support for what
you need. 

> Another way to look at it is shared inode->i_security across
> different inodes won't work for us. We need something really
> cheap like single 'inode_bpf_storage' pointer that is zero
> most of the time and for few inodes bpf progs will keep their
> scratch data in there.
> For now lsm style bpf_inode() approach is ok-ish.
> But we will come back when we collect perf numbers to justify
> why direct pointer in the 'struct inode' is a win.

It would be really helpful if instead of "the regular mechanism
is too slow, so we abandoned it" we could do "the regular mechanism
was to slow, so we made it better". I appreciate that every bit of
performance matters. That's true for all of the security modules,
which is why I object to special case tricks to boost the performance
of one module at the expense of everything else.
James Morris June 3, 2020, 10:12 p.m. UTC | #8
On Wed, 3 Jun 2020, Casey Schaufler wrote:

> The use of security modules was expected to be rare.

This is not correct. Capabilities were ported to LSM and stacked from the 
beginning, and several major distros worked on LSM so they could ship 
their own security modules.
Casey Schaufler June 3, 2020, 10:38 p.m. UTC | #9
On 6/3/2020 3:12 PM, James Morris wrote:
> On Wed, 3 Jun 2020, Casey Schaufler wrote:
>
>> The use of security modules was expected to be rare.
> This is not correct. Capabilities were ported to LSM and stacked from the 
> beginning, and several major distros worked on LSM so they could ship 
> their own security modules.

Capabilities has always been a special case.
Until Android adopted SELinux the actual use of LSMs was rare.
Paul Moore June 3, 2020, 11:35 p.m. UTC | #10
On Tue, Jun 2, 2020 at 8:31 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> Side note since you mention performance work: in the past when I've
> looked at SELinux performance (generally as part of pathname lookup
> etc VFS loads), the biggest cost by far was that all the SELinux data
> structures take a ton of cache misses.
>
> Yes, some of the hashing shows up in the profiles, but _most_ of it
> was loading the data from inode->i_security etc.
>
> And the reason seemed to be that every single inode ends up having a
> separately allocated "struct inode_security_struct" (aka "isec"). Even
> if the contents are often all exactly the same for a large set of
> inodes that thus _could_ conceptually share the data.
>
> Now, it used to be - before being able to stack security layers -
> SElinux would control that pointer, and it could have done some kind
> of sharing scheme with copy-on-write behavior (the way we do 'struct
> cred' for processes), and it would have caused a much smaller cache
> footprint (and thus likely much fewer cache misses).

I believe right about the time that Eric Paris was stepping away from
SELinux he was working on a patchset that basically did what you
describe: copy-on-write for the SELinux inode blobs (aka
inode_security_struct, aka isec, etc.).  Unfortunately I don't believe
that work was ever finished and the idea was lost many years ago in
the maintainer shuffle; I was trying to figure out this whole
"maintainer thing" and perhaps didn't push Eric to post those patches
as much as I should have.  Although it's a big academic now with the
LSM stacking work.

Most of my SELinux thoughts these days are around the correctness and
robustness of the code, making sure we are testing as much as possible
(related to the first point), and trying to catch changes in other
subsystems which cause us breakage.  Not the most glamorous stuff, but
it's important.  SELinux is lucky enough to have a few active kernel
developers, and thankfully a couple of them appear to be looking at
some of the performance issues.

> These days, that sharing of the i_security pointer across different
> security layers makes that sound really really painful.

Yeah.  It's pretty much impossible now to do copy-on-write with the
main security blobs due to the differing nature of the LSMs and the
single, shared allocation for each blob.  I suppose if you wanted to
attempt copy-on-write inside a LSM you could introduce another layer
of pointers/allocation, but I'm not sure how much of an improvement
that might be.

Perhaps a bit more thought will produce a "eureka!" moment, but I'm
not overly optimistic.

> But I do wonder if anybody in selinux land (or general security
> subsystem land) has been thinking of maybe at least having a "this
> inode has no special labeling" marker that could possibly avoid having
> all those extra allocations.

I don't want to get into the "security people can't agree on anything"
discussion, but I think for that to work all of the loaded LSMs would
need to agree that they don't need to stash anything in the inode (or
other object); which I think is pretty much impossible most of the
time.  At least in the SELinux case, even if we were doing some sort
of copy-on-write, we would need to keep a reference back to the inode
security blob that does contain our needed info.
Alexei Starovoitov June 3, 2020, 11:38 p.m. UTC | #11
On Wed, Jun 03, 2020 at 03:00:05PM -0700, Casey Schaufler wrote:
> On 6/3/2020 2:02 PM, Alexei Starovoitov wrote:
> > On Wed, Jun 03, 2020 at 10:37:47AM -0700, Linus Torvalds wrote:
> >> On Wed, Jun 3, 2020 at 10:20 AM Casey Schaufler <casey@schaufler-ca.com> wrote:
> >>> We could have inode->i_security be the blob, rather than a pointer to it.
> >>> That will have its own performance issues.
> >> It wouldn't actually really fix anything, because the inode is so big
> >> and sparsely accessed that it doesn't even really help the cache
> >> density issue. Yeah, it gets rid of the pointer access, but that's
> >> pretty much it. The fact that we randomize the order means that we
> >> can't even really try to aim for any cache density.
> >>
> >> And it would actually not be possible with the current layered
> >> security model anyway, since those blob sizes are dynamic at runtime.
> >>
> >> If we had _only_ SELinux, we could perhaps have hidden the
> >> sid/sclass/task_sid directly in the inode (it would be only slightly
> >> larger than the pointer is, anyway), but even that ship sailed long
> >> long ago due to the whole "no security person can ever agree with
> >> another one on fundamentals".
> > Also there is bpf_lsm now that we're going to run it in production, 
> > so performance is as important as ever.
> > Traditional lsm-s have per-lsm per-inode blob.
> > For bpf that doesn't work, since progs come and go at run-time and
> > independent from each other.
> 
> The inode blob is for attributes associated with the filesystem object.
> There are cred and task blobs for program information.
> If you need separate per-task data you should put it there.

task_struct will have its own member for bpf local storage as I explained.
For most structs we won't be adding new member or even reusing lsm's security
blob pointer, because existing bpf hash table approach is fast enough and
flexible, since it can key based on anything. Any pointer and any struct can
have additional scratch area available to bpf progs. We're adding
*_local_storage only to kernel objects where performance is critical. Like
'struct sock'.

> > So we need per-program per-inode blob.
> > To maintain good performance we've proposed:
> > @@ -740,6 +741,10 @@  struct inode {
> >  	struct fsverity_info	*i_verity_info;
> >  #endif
> >  
> > +#ifdef CONFIG_BPF_SYSCALL
> > +	struct bpf_local_storage __rcu	*inode_bpf_storage;
> > +#endif
> >
> > https://patchwork.ozlabs.org/project/netdev/patch/20200526163336.63653-3-kpsingh@chromium.org/
> >
> > but got pushback, so we're going to use lsm style for now:
> > +static inline struct bpf_lsm_storage *bpf_inode(const struct inode *inode)
> > +{
> > +       if (unlikely(!inode->i_security))
> > +               return NULL;
> > +
> > +       return inode->i_security + bpf_lsm_blob_sizes.lbs_inode;
> > +}
> >
> > which means extra kmalloc for every inode, extra pointer deref, global var
> > access, and additional math just to get to 'inode_bpf_storage' pointer.
> >
> > We have similar pointer in 'struct sock' already:
> > #ifdef CONFIG_BPF_SYSCALL
> >         struct bpf_sk_storage __rcu     *sk_bpf_storage;
> > #endif
> > that is used by variety of networking bpf programs.
> > The commit 6ac99e8f23d4 ("bpf: Introduce bpf sk local storage")
> > has benchmarking data for it:
> >   hash table with 8-byte key -> 152ns per bpf run 
> >   sk_bpf_storage -> 66ns per bpf run
> > Hashtable suppose to be O(1) with L1$ hit, but it turned out
> > to be noticeably slower than sk_bpf_storage.
> > We expect to see similar performance gains for inode_bpf_storage
> > vs hashtable approach that people use now.
> > Eventually we'll add task_bpf_storage as well.
> > Right now every other bpf tracing script is using pid as a key
> > in a separate hash table to store per-task data. For high frequency
> > events that adds up. task_bpf_storage will accelerate that.
> 
> Why aren't you using a task blob? We have support for what
> you need. 
> 
> > Another way to look at it is shared inode->i_security across
> > different inodes won't work for us. We need something really
> > cheap like single 'inode_bpf_storage' pointer that is zero
> > most of the time and for few inodes bpf progs will keep their
> > scratch data in there.
> > For now lsm style bpf_inode() approach is ok-ish.
> > But we will come back when we collect perf numbers to justify
> > why direct pointer in the 'struct inode' is a win.
> 
> It would be really helpful if instead of "the regular mechanism
> is too slow, so we abandoned it" we could do "the regular mechanism
> was to slow, so we made it better". I appreciate that every bit of
> performance matters. That's true for all of the security modules,
> which is why I object to special case tricks to boost the performance
> of one module at the expense of everything else.

It's not about lsm and not about bpf_lsm.
The *_local_storage infrastructure is generic and available to
bpf program types where it can be done safely.
Like sk_local_storage is avaiable to ~25% of networking bpf prog types.
(not to 100% because bpf progs run in a many different contexts
and in some sock pointer is not accessibly safely).
Similarly inode_local_storage will be available to different types
of progs. Including tracing progs where it's safe.

To rephrase bpf's *_local_storage could have been used by selinux
and other lsm instead of current infra and likely would have been
faster. In that sense selinux would be seen as another bpf program.
It's not a program, of course, but it could have accessed it through
the same mechanism.
So, no, we're not "abandoning existing mechanism because it's too slow"
we've created a new mechanism because it's generic (not related
to lsm-s and security) and made it faster for all, so other kernel
bits (including lsm-s) are welcome to use it.
James Morris June 4, 2020, 2:13 a.m. UTC | #12
On Wed, 3 Jun 2020, Casey Schaufler wrote:

> On 6/3/2020 3:12 PM, James Morris wrote:
> > On Wed, 3 Jun 2020, Casey Schaufler wrote:
> >
> >> The use of security modules was expected to be rare.
> > This is not correct. Capabilities were ported to LSM and stacked from the 
> > beginning, and several major distros worked on LSM so they could ship 
> > their own security modules.
> 
> Capabilities has always been a special case.
> Until Android adopted SELinux the actual use of LSMs was rare.

Nope, it was enabled by default in several distros and very widely 
deployed in the govt space (at least).
Stephen Smalley June 4, 2020, 12:45 p.m. UTC | #13
On Wed, Jun 3, 2020 at 6:39 PM Casey Schaufler <casey@schaufler-ca.com> wrote:
>
> On 6/3/2020 3:12 PM, James Morris wrote:
> > On Wed, 3 Jun 2020, Casey Schaufler wrote:
> >
> >> The use of security modules was expected to be rare.
> > This is not correct. Capabilities were ported to LSM and stacked from the
> > beginning, and several major distros worked on LSM so they could ship
> > their own security modules.
>
> Capabilities has always been a special case.
> Until Android adopted SELinux the actual use of LSMs was rare.

I don't think that is correct.  Fedora/RHEL were enabling SELinux by
default since around 2004/2005 and for a while Fedora was tracking
SELinux status as part of their "smolt" hardware profiling project and
SELinux enablement was trending above 80% IIRC before they
de-commissioned smolt. SuSE/SLES and Ubuntu were enabling AppArmor by
default for quite some time too prior to SE Android.  It is certainly
true that Android's adoption of SELinux massively increased the size
of the SELinux install base (and was the first to make SELinux usage
mandatory, not just default-enabled) but I don't think it is accurate
to say that LSM usage was rare prior to that.
Casey Schaufler June 4, 2020, 3:28 p.m. UTC | #14
On 6/4/2020 5:45 AM, Stephen Smalley wrote:
> On Wed, Jun 3, 2020 at 6:39 PM Casey Schaufler <casey@schaufler-ca.com> wrote:
>> On 6/3/2020 3:12 PM, James Morris wrote:
>>> On Wed, 3 Jun 2020, Casey Schaufler wrote:
>>>
>>>> The use of security modules was expected to be rare.
>>> This is not correct. Capabilities were ported to LSM and stacked from the
>>> beginning, and several major distros worked on LSM so they could ship
>>> their own security modules.
>> Capabilities has always been a special case.
>> Until Android adopted SELinux the actual use of LSMs was rare.
> I don't think that is correct.  Fedora/RHEL were enabling SELinux by
> default since around 2004/2005 and for a while Fedora was tracking
> SELinux status as part of their "smolt" hardware profiling project and
> SELinux enablement was trending above 80% IIRC before they
> de-commissioned smolt. SuSE/SLES and Ubuntu were enabling AppArmor by
> default for quite some time too prior to SE Android.

POSIX ACLs have been enabled just as long. Their use is still
incredibly rare.

>   It is certainly
> true that Android's adoption of SELinux massively increased the size
> of the SELinux install base (and was the first to make SELinux usage
> mandatory, not just default-enabled) but I don't think it is accurate
> to say that LSM usage was rare prior to that.

That will depend on whether you consider presence to be usage.
That gets into the whole "transparent security" argument.
Sorry I brought this up. I don't mean to disrespect the achievement
of SELinux. My experience of the Orange Book and early Common
Criteria era, including the Unix to Linux transition, seems to
have differed somewhat from that others.
Stephen Smalley June 4, 2020, 3:33 p.m. UTC | #15
On Thu, Jun 4, 2020 at 11:28 AM Casey Schaufler <casey@schaufler-ca.com> wrote:
>
> On 6/4/2020 5:45 AM, Stephen Smalley wrote:
> > On Wed, Jun 3, 2020 at 6:39 PM Casey Schaufler <casey@schaufler-ca.com> wrote:
> >> On 6/3/2020 3:12 PM, James Morris wrote:
> >>> On Wed, 3 Jun 2020, Casey Schaufler wrote:
> >>>
> >>>> The use of security modules was expected to be rare.
> >>> This is not correct. Capabilities were ported to LSM and stacked from the
> >>> beginning, and several major distros worked on LSM so they could ship
> >>> their own security modules.
> >> Capabilities has always been a special case.
> >> Until Android adopted SELinux the actual use of LSMs was rare.
> > I don't think that is correct.  Fedora/RHEL were enabling SELinux by
> > default since around 2004/2005 and for a while Fedora was tracking
> > SELinux status as part of their "smolt" hardware profiling project and
> > SELinux enablement was trending above 80% IIRC before they
> > de-commissioned smolt. SuSE/SLES and Ubuntu were enabling AppArmor by
> > default for quite some time too prior to SE Android.
>
> POSIX ACLs have been enabled just as long. Their use is still
> incredibly rare.
>
> >   It is certainly
> > true that Android's adoption of SELinux massively increased the size
> > of the SELinux install base (and was the first to make SELinux usage
> > mandatory, not just default-enabled) but I don't think it is accurate
> > to say that LSM usage was rare prior to that.
>
> That will depend on whether you consider presence to be usage.
> That gets into the whole "transparent security" argument.

The distros were shipping policies for their respective LSMs that
confined some subset of the processes, and userspace was leveraging
those LSMs (both to get/set labels and to get policy decisions for
userspace enforcers) well before Android adopted SELinux.  I think
that counts as usage.  If by usage you mean end users were writing
their own policies, that certainly is a more specialized class of
users but that's even less so in Android, where end users aren't
allowed to modify the policy at all.

> Sorry I brought this up. I don't mean to disrespect the achievement
> of SELinux. My experience of the Orange Book and early Common
> Criteria era, including the Unix to Linux transition, seems to
> have differed somewhat from that others.