diff mbox series

[RFC,2/9] audit,io_uring,io-wq: add some basic audit support to io_uring

Message ID 162163379461.8379.9691291608621179559.stgit@sifl (mailing list archive)
State New, archived
Headers show
Series Add LSM access controls and auditing to io_uring | expand

Commit Message

Paul Moore May 21, 2021, 9:49 p.m. UTC
WARNING - This is a work in progress and should not be merged
anywhere important.  It is almost surely not complete, and while it
probably compiles it likely hasn't been booted and will do terrible
things.  You have been warned.

This patch adds basic auditing to io_uring operations, regardless of
their context.  This is accomplished by allocating audit_context
structures for the io-wq worker and io_uring SQPOLL kernel threads
as well as explicitly auditing the io_uring operations in
io_issue_sqe().  The io_uring operations are audited using a new
AUDIT_URINGOP record, an example is shown below:

  % <TODO - insert AUDIT_URINGOP record example>

Thanks to Richard Guy Briggs for review and feedback.

Signed-off-by: Paul Moore <paul@paul-moore.com>
---
 fs/io-wq.c                 |    4 +
 fs/io_uring.c              |   11 +++
 include/linux/audit.h      |   17 ++++
 include/uapi/linux/audit.h |    1 
 kernel/audit.h             |    2 +
 kernel/auditsc.c           |  173 ++++++++++++++++++++++++++++++++++++++++++++
 6 files changed, 208 insertions(+)

Comments

Pavel Begunkov May 22, 2021, 12:22 a.m. UTC | #1
On 5/21/21 10:49 PM, Paul Moore wrote:
> WARNING - This is a work in progress and should not be merged
> anywhere important.  It is almost surely not complete, and while it
> probably compiles it likely hasn't been booted and will do terrible
> things.  You have been warned.
> 
> This patch adds basic auditing to io_uring operations, regardless of
> their context.  This is accomplished by allocating audit_context
> structures for the io-wq worker and io_uring SQPOLL kernel threads
> as well as explicitly auditing the io_uring operations in
> io_issue_sqe().  The io_uring operations are audited using a new
> AUDIT_URINGOP record, an example is shown below:
> 
>   % <TODO - insert AUDIT_URINGOP record example>
> 
> Thanks to Richard Guy Briggs for review and feedback.
> 
> Signed-off-by: Paul Moore <paul@paul-moore.com>
> ---
[...]
> diff --git a/fs/io_uring.c b/fs/io_uring.c
> index e481ac8a757a..e9941d1ad8fd 100644
> --- a/fs/io_uring.c
> +++ b/fs/io_uring.c
> @@ -78,6 +78,7 @@
>  #include <linux/task_work.h>
>  #include <linux/pagemap.h>
>  #include <linux/io_uring.h>
> +#include <linux/audit.h>
>  
>  #define CREATE_TRACE_POINTS
>  #include <trace/events/io_uring.h>
> @@ -6105,6 +6106,9 @@ static int io_issue_sqe(struct io_kiocb *req, unsigned int issue_flags)
>  	if (req->work.creds && req->work.creds != current_cred())
>  		creds = override_creds(req->work.creds);
>  
> +	if (req->opcode < IORING_OP_LAST)

always true at this point

> +		audit_uring_entry(req->opcode);

So, it adds two if's with memory loads (i.e. current->audit_context)
per request in one of the hottest functions here... No way, nack

Maybe, if it's dynamically compiled into like kprobes if it's
_really_ used.

> +
>  	switch (req->opcode) {
>  	case IORING_OP_NOP:
>  		ret = io_nop(req, issue_flags);
> @@ -6211,6 +6215,9 @@ static int io_issue_sqe(struct io_kiocb *req, unsigned int issue_flags)
>  		break;
>  	}
>  
> +	if (req->opcode < IORING_OP_LAST)
> +		audit_uring_exit(!ret, ret);
> +
>  	if (creds)
>  		revert_creds(creds);
Paul Moore May 22, 2021, 2:36 a.m. UTC | #2
On Fri, May 21, 2021 at 8:22 PM Pavel Begunkov <asml.silence@gmail.com> wrote:
> On 5/21/21 10:49 PM, Paul Moore wrote:
> > WARNING - This is a work in progress and should not be merged
> > anywhere important.  It is almost surely not complete, and while it
> > probably compiles it likely hasn't been booted and will do terrible
> > things.  You have been warned.
> >
> > This patch adds basic auditing to io_uring operations, regardless of
> > their context.  This is accomplished by allocating audit_context
> > structures for the io-wq worker and io_uring SQPOLL kernel threads
> > as well as explicitly auditing the io_uring operations in
> > io_issue_sqe().  The io_uring operations are audited using a new
> > AUDIT_URINGOP record, an example is shown below:
> >
> >   % <TODO - insert AUDIT_URINGOP record example>
> >
> > Thanks to Richard Guy Briggs for review and feedback.
> >
> > Signed-off-by: Paul Moore <paul@paul-moore.com>
> > ---
> [...]
> > diff --git a/fs/io_uring.c b/fs/io_uring.c
> > index e481ac8a757a..e9941d1ad8fd 100644
> > --- a/fs/io_uring.c
> > +++ b/fs/io_uring.c
> > @@ -78,6 +78,7 @@
> >  #include <linux/task_work.h>
> >  #include <linux/pagemap.h>
> >  #include <linux/io_uring.h>
> > +#include <linux/audit.h>
> >
> >  #define CREATE_TRACE_POINTS
> >  #include <trace/events/io_uring.h>
> > @@ -6105,6 +6106,9 @@ static int io_issue_sqe(struct io_kiocb *req, unsigned int issue_flags)
> >       if (req->work.creds && req->work.creds != current_cred())
> >               creds = override_creds(req->work.creds);
> >
> > +     if (req->opcode < IORING_OP_LAST)
>
> always true at this point

I placed the opcode check before the audit call because the switch
statement below which handles the operation dispatching has a 'ret =
-EINVAL' for the default case, implying that there are some paths
where an invalid opcode could be passed into the function.  Obviously
if that is not the case and you can guarantee that req->opcode will
always be valid we can easily drop the check prior to the audit call.

> > +             audit_uring_entry(req->opcode);
>
> So, it adds two if's with memory loads (i.e. current->audit_context)
> per request in one of the hottest functions here... No way, nack
>
> Maybe, if it's dynamically compiled into like kprobes if it's
> _really_ used.

I'm open to suggestions on how to tweak the io_uring/audit
integration, if you don't like what I've proposed in this patchset,
lets try to come up with a solution that is more palatable.  If you
were going to add audit support for these io_uring operations, how
would you propose we do it?  Not being able to properly audit io_uring
operations is going to be a significant issue for a chunk of users, if
it isn't already, we need to work to find a solution to this problem.

Unfortunately I don't think dynamically inserting audit calls is
something that would meet the needs of the audit community (I fear it
would run afoul of the various security certifications), and it
definitely isn't something that we support at present.
Pavel Begunkov May 23, 2021, 8:26 p.m. UTC | #3
On 5/22/21 3:36 AM, Paul Moore wrote:
> On Fri, May 21, 2021 at 8:22 PM Pavel Begunkov <asml.silence@gmail.com> wrote:
>> On 5/21/21 10:49 PM, Paul Moore wrote:
[...]
>>>
>>> +     if (req->opcode < IORING_OP_LAST)
>>
>> always true at this point
> 
> I placed the opcode check before the audit call because the switch
> statement below which handles the operation dispatching has a 'ret =
> -EINVAL' for the default case, implying that there are some paths
> where an invalid opcode could be passed into the function.  Obviously
> if that is not the case and you can guarantee that req->opcode will
> always be valid we can easily drop the check prior to the audit call.

It is always true at this point, would be completely broken
otherwise

>>> +             audit_uring_entry(req->opcode);
>>
>> So, it adds two if's with memory loads (i.e. current->audit_context)
>> per request in one of the hottest functions here... No way, nack
>>
>> Maybe, if it's dynamically compiled into like kprobes if it's
>> _really_ used.
> 
> I'm open to suggestions on how to tweak the io_uring/audit
> integration, if you don't like what I've proposed in this patchset,
> lets try to come up with a solution that is more palatable.  If you
> were going to add audit support for these io_uring operations, how
> would you propose we do it?  Not being able to properly audit io_uring
> operations is going to be a significant issue for a chunk of users, if
> it isn't already, we need to work to find a solution to this problem.

Who knows. First of all, seems CONFIG_AUDIT is enabled by default
for many popular distributions, so I assume that is not compiled out.

What are use cases for audit? Always running I guess? Putting aside
compatibility problems, it sounds that with the amount of overhead
it adds there is no much profit in using io_uring in the first place.
Is that so?

__audit_uring_exit()
-> audit_filter_syscall()
  -> for (audit_list) if (...) audit_filter_rules()
    -> ...
-> audit_filter_inodes()
  -> ...

> Unfortunately I don't think dynamically inserting audit calls is
> something that would meet the needs of the audit community (I fear it
> would run afoul of the various security certifications), and it
> definitely isn't something that we support at present.

I see
Paul Moore May 24, 2021, 7:59 p.m. UTC | #4
On Sun, May 23, 2021 at 4:26 PM Pavel Begunkov <asml.silence@gmail.com> wrote:
> On 5/22/21 3:36 AM, Paul Moore wrote:
> > On Fri, May 21, 2021 at 8:22 PM Pavel Begunkov <asml.silence@gmail.com> wrote:
> >> On 5/21/21 10:49 PM, Paul Moore wrote:
> [...]
> >>>
> >>> +     if (req->opcode < IORING_OP_LAST)
> >>
> >> always true at this point
> >
> > I placed the opcode check before the audit call because the switch
> > statement below which handles the operation dispatching has a 'ret =
> > -EINVAL' for the default case, implying that there are some paths
> > where an invalid opcode could be passed into the function.  Obviously
> > if that is not the case and you can guarantee that req->opcode will
> > always be valid we can easily drop the check prior to the audit call.
>
> It is always true at this point, would be completely broken
> otherwise

Understood, I was just pointing out an oddity in the code.  I just
dropped the checks from my local tree, you'll see it in the next draft
of the patchset.

> >> So, it adds two if's with memory loads (i.e. current->audit_context)
> >> per request in one of the hottest functions here... No way, nack
> >>
> >> Maybe, if it's dynamically compiled into like kprobes if it's
> >> _really_ used.
> >
> > I'm open to suggestions on how to tweak the io_uring/audit
> > integration, if you don't like what I've proposed in this patchset,
> > lets try to come up with a solution that is more palatable.  If you
> > were going to add audit support for these io_uring operations, how
> > would you propose we do it?  Not being able to properly audit io_uring
> > operations is going to be a significant issue for a chunk of users, if
> > it isn't already, we need to work to find a solution to this problem.
>
> Who knows. First of all, seems CONFIG_AUDIT is enabled by default
> for many popular distributions, so I assume that is not compiled out.
>
> What are use cases for audit? Always running I guess?

Audit has been around for quite some time now, and it's goal is to
provide a mechanism for logging "security relevant" information in
such a way that it meets the needs of various security certification
efforts.  Traditional Linux event logging, e.g. syslog and the like,
does not meet these requirements and changing them would likely affect
the usability for those who are not required to be compliant with
these security certifications.  The Linux audit subsystem allows Linux
to be used in places it couldn't be used otherwise (or rather makes it
a *lot* easier).

That said, audit is not for everyone, and we have build time and
runtime options to help make life easier.  Beyond simply disabling
audit at compile time a number of Linux distributions effectively
shortcut audit at runtime by adding a "never" rule to the audit
filter, for example:

 % auditctl -a task,never

> Putting aside compatibility problems, it sounds that with the amount of overhead
> it adds there is no much profit in using io_uring in the first place.
> Is that so?

Well, if audit alone erased all of the io_uring advantages we should
just rip io_uring out of the kernel and people can just disable audit
instead ;)

I believe there are people who would like to use io_uring and are also
required to use a kernel with audit, either due to the need to run a
distribution kernel or the need to capture security information in the
audit stream.  I'm hoping that we can find a solution for these users;
if we don't we are asking this group to choose either io_uring or
audit, and that is something I would like to avoid.
Pavel Begunkov May 25, 2021, 8:27 a.m. UTC | #5
On 5/24/21 8:59 PM, Paul Moore wrote:
> On Sun, May 23, 2021 at 4:26 PM Pavel Begunkov <asml.silence@gmail.com> wrote:
>> On 5/22/21 3:36 AM, Paul Moore wrote:
>>> On Fri, May 21, 2021 at 8:22 PM Pavel Begunkov <asml.silence@gmail.com> wrote:
>>>> On 5/21/21 10:49 PM, Paul Moore wrote:
>> [...]
>>>>>
>>>>> +     if (req->opcode < IORING_OP_LAST)
>>>>
>>>> always true at this point
>>>
>>> I placed the opcode check before the audit call because the switch
>>> statement below which handles the operation dispatching has a 'ret =
>>> -EINVAL' for the default case, implying that there are some paths
>>> where an invalid opcode could be passed into the function.  Obviously
>>> if that is not the case and you can guarantee that req->opcode will
>>> always be valid we can easily drop the check prior to the audit call.
>>
>> It is always true at this point, would be completely broken
>> otherwise
> 
> Understood, I was just pointing out an oddity in the code.  I just
> dropped the checks from my local tree, you'll see it in the next draft
> of the patchset.
> 
>>>> So, it adds two if's with memory loads (i.e. current->audit_context)
>>>> per request in one of the hottest functions here... No way, nack
>>>>
>>>> Maybe, if it's dynamically compiled into like kprobes if it's
>>>> _really_ used.
>>>
>>> I'm open to suggestions on how to tweak the io_uring/audit
>>> integration, if you don't like what I've proposed in this patchset,
>>> lets try to come up with a solution that is more palatable.  If you
>>> were going to add audit support for these io_uring operations, how
>>> would you propose we do it?  Not being able to properly audit io_uring
>>> operations is going to be a significant issue for a chunk of users, if
>>> it isn't already, we need to work to find a solution to this problem.
>>
>> Who knows. First of all, seems CONFIG_AUDIT is enabled by default
>> for many popular distributions, so I assume that is not compiled out.
>>
>> What are use cases for audit? Always running I guess?
> 
> Audit has been around for quite some time now, and it's goal is to
> provide a mechanism for logging "security relevant" information in
> such a way that it meets the needs of various security certification
> efforts.  Traditional Linux event logging, e.g. syslog and the like,
> does not meet these requirements and changing them would likely affect
> the usability for those who are not required to be compliant with
> these security certifications.  The Linux audit subsystem allows Linux
> to be used in places it couldn't be used otherwise (or rather makes it
> a *lot* easier).
> 
> That said, audit is not for everyone, and we have build time and
> runtime options to help make life easier.  Beyond simply disabling
> audit at compile time a number of Linux distributions effectively
> shortcut audit at runtime by adding a "never" rule to the audit
> filter, for example:
> 
>  % auditctl -a task,never
> 
>> Putting aside compatibility problems, it sounds that with the amount of overhead
>> it adds there is no much profit in using io_uring in the first place.
>> Is that so?
> 
> Well, if audit alone erased all of the io_uring advantages we should
> just rip io_uring out of the kernel and people can just disable audit
> instead ;)


Hah, if we add a simple idle "feature" like

for (i=0;i<1000000;i+) {;}

and it would destroy all the performance, let's throw useless
Linux kernel then!

If seriously, it's more of an open question to me, how much overhead
it adds if enabled (with typical filters/options/etc).

Btw, do you really need two hooks -- before and right after
execution?

> I believe there are people who would like to use io_uring and are also
> required to use a kernel with audit, either due to the need to run a
> distribution kernel or the need to capture security information in the
> audit stream.  I'm hoping that we can find a solution for these users;
> if we don't we are asking this group to choose either io_uring or
> audit, and that is something I would like to avoid.
Paul Moore May 25, 2021, 2:53 p.m. UTC | #6
On Tue, May 25, 2021 at 4:27 AM Pavel Begunkov <asml.silence@gmail.com> wrote:
> On 5/24/21 8:59 PM, Paul Moore wrote:
> > On Sun, May 23, 2021 at 4:26 PM Pavel Begunkov <asml.silence@gmail.com> wrote:
> >> On 5/22/21 3:36 AM, Paul Moore wrote:
> >>> On Fri, May 21, 2021 at 8:22 PM Pavel Begunkov <asml.silence@gmail.com> wrote:
> >>>> On 5/21/21 10:49 PM, Paul Moore wrote:
> >> [...]
> >>>>>
> >>>>> +     if (req->opcode < IORING_OP_LAST)
> >>>>
> >>>> always true at this point
> >>>
> >>> I placed the opcode check before the audit call because the switch
> >>> statement below which handles the operation dispatching has a 'ret =
> >>> -EINVAL' for the default case, implying that there are some paths
> >>> where an invalid opcode could be passed into the function.  Obviously
> >>> if that is not the case and you can guarantee that req->opcode will
> >>> always be valid we can easily drop the check prior to the audit call.
> >>
> >> It is always true at this point, would be completely broken
> >> otherwise
> >
> > Understood, I was just pointing out an oddity in the code.  I just
> > dropped the checks from my local tree, you'll see it in the next draft
> > of the patchset.
> >
> >>>> So, it adds two if's with memory loads (i.e. current->audit_context)
> >>>> per request in one of the hottest functions here... No way, nack
> >>>>
> >>>> Maybe, if it's dynamically compiled into like kprobes if it's
> >>>> _really_ used.
> >>>
> >>> I'm open to suggestions on how to tweak the io_uring/audit
> >>> integration, if you don't like what I've proposed in this patchset,
> >>> lets try to come up with a solution that is more palatable.  If you
> >>> were going to add audit support for these io_uring operations, how
> >>> would you propose we do it?  Not being able to properly audit io_uring
> >>> operations is going to be a significant issue for a chunk of users, if
> >>> it isn't already, we need to work to find a solution to this problem.
> >>
> >> Who knows. First of all, seems CONFIG_AUDIT is enabled by default
> >> for many popular distributions, so I assume that is not compiled out.
> >>
> >> What are use cases for audit? Always running I guess?
> >
> > Audit has been around for quite some time now, and it's goal is to
> > provide a mechanism for logging "security relevant" information in
> > such a way that it meets the needs of various security certification
> > efforts.  Traditional Linux event logging, e.g. syslog and the like,
> > does not meet these requirements and changing them would likely affect
> > the usability for those who are not required to be compliant with
> > these security certifications.  The Linux audit subsystem allows Linux
> > to be used in places it couldn't be used otherwise (or rather makes it
> > a *lot* easier).
> >
> > That said, audit is not for everyone, and we have build time and
> > runtime options to help make life easier.  Beyond simply disabling
> > audit at compile time a number of Linux distributions effectively
> > shortcut audit at runtime by adding a "never" rule to the audit
> > filter, for example:
> >
> >  % auditctl -a task,never
> >
> >> Putting aside compatibility problems, it sounds that with the amount of overhead
> >> it adds there is no much profit in using io_uring in the first place.
> >> Is that so?
> >
> > Well, if audit alone erased all of the io_uring advantages we should
> > just rip io_uring out of the kernel and people can just disable audit
> > instead ;)
>
> Hah, if we add a simple idle "feature" like
>
> for (i=0;i<1000000;i+) {;}
>
> and it would destroy all the performance, let's throw useless
> Linux kernel then!
>
> If seriously, it's more of an open question to me, how much overhead
> it adds if enabled (with typical filters/options/etc).

I am very hesitant to define a "typical" audit configuration as it
really is dependent on the user's security requirements as well as the
particular solution/environment.  Any configuration I pick will be
"wrong" for a lot of audit users :)

As I see it, users will likely find themselves in one of three
performance buckets:

* io_uring enabled, CONFIG_AUDIT=n

For those who are already trying to get that last 1% of performance
from their kernel and are willing to give up most everything else to
get it this is the obvious choice.  Needless to say there should not
be any audit related impact in this case (especially since we've
removed that req->opcode checks prior to the audit calls).

* io_uring enabled, CONFIG_AUDIT=y, audit "task,never" runtime config

[side note: I made some tweaks to audit_uring_entry() to move the
audit_enabled check there so we no longer need to call into
__audit_uring_entry() in this fastpath case.  Similarly
audit_uring_exit() now does an audit_dummy_context() check instead of
simply checking audit_context(), this should avoid calling into
__audit_uring_exit() when the io_uring op is not being audited.]

I'm guessing that most distro users will fall into this bucket.  Here
the task's audit_context should always be NULL, both in the syscall
context and sqpoll context, so io_uring would call into the inline
audit_uring_entry() function and return without calling into
__audit_uring_entry() (see the "side note" above).  The
audit_uring_exit() call would behave in a similar manner.

* io_uring enabled, CONFIG_AUDIT=y, some sort of runtime audit config

For obvious reasons this case has the most performance impact and as
mentioned above it can vary quite a bit.  In my opinion this is the
least important bucket from a performance perspective as the user
clearly has a need for the security information that audit provides
and enabling that functionality in io_uring is critical to allowing
the user to take advantage of io_uring.  The performance of io_uring
is impacted, but it should still be more performant than synchronous
I/O and it allows the user to run their existing io_uring
applications/code-paths.

> Btw, do you really need two hooks -- before and right after
> execution?

Yes, the entry/before hook does the setup for the io_uring op (very
similar to a syscall entry point) and the exit/after hook is what does
the audit record generation based on what took place during the
io_uring operation.

> > I believe there are people who would like to use io_uring and are also
> > required to use a kernel with audit, either due to the need to run a
> > distribution kernel or the need to capture security information in the
> > audit stream.  I'm hoping that we can find a solution for these users;
> > if we don't we are asking this group to choose either io_uring or
> > audit, and that is something I would like to avoid.

I'm leaving this old paragraph here because I really think this is
important to the discussion.  If we can't find a solution here we
would need to make io_uring and audit mutually exclusive and I don't
think that is in the best interests of the users, and would surely
create a headache for the distros.
Jens Axboe May 26, 2021, 1:11 a.m. UTC | #7
On 5/24/21 1:59 PM, Paul Moore wrote:
> That said, audit is not for everyone, and we have build time and
> runtime options to help make life easier.  Beyond simply disabling
> audit at compile time a number of Linux distributions effectively
> shortcut audit at runtime by adding a "never" rule to the audit
> filter, for example:
> 
>  % auditctl -a task,never

As has been brought up, the issue we're facing is that distros have
CONFIG_AUDIT=y and hence the above is the best real world case outside
of people doing custom kernels. My question would then be how much
overhead the above will add, considering it's an entry/exit call per op.
If auditctl is turned off, what is the expectation in turns of overhead?

My gut feeling tells me it's likely going to be too much. Keep in mind
that we're sometimes doing millions of operations per second, per core.

aio never had any audit logging as far as I can tell. I think it'd make
a lot more sense to selectively enable audit logging only for opcodes
that we care about. File open/create/unlink/mkdir etc, that kind of
thing. File level operations that people would care about logging. Would
they care about logging a buffer registration or a polled read from a
device/file? I highly doubt it, and we don't do that for alternative
methods either. Doesn't really make sense for a lot of the other
operations, imho.
Paul Moore May 26, 2021, 2:04 a.m. UTC | #8
On Tue, May 25, 2021 at 9:11 PM Jens Axboe <axboe@kernel.dk> wrote:
> On 5/24/21 1:59 PM, Paul Moore wrote:
> > That said, audit is not for everyone, and we have build time and
> > runtime options to help make life easier.  Beyond simply disabling
> > audit at compile time a number of Linux distributions effectively
> > shortcut audit at runtime by adding a "never" rule to the audit
> > filter, for example:
> >
> >  % auditctl -a task,never
>
> As has been brought up, the issue we're facing is that distros have
> CONFIG_AUDIT=y and hence the above is the best real world case outside
> of people doing custom kernels. My question would then be how much
> overhead the above will add, considering it's an entry/exit call per op.
> If auditctl is turned off, what is the expectation in turns of overhead?

I commented on that case in my last email to Pavel, but I'll try to go
over it again in a little more detail.

As we discussed earlier in this thread, we can skip the req->opcode
check before both the _entry and _exit calls, so we are left with just
the bare audit calls in the io_uring code.  As the _entry and _exit
functions are small, I've copied them and their supporting functions
below and I'll try to explain what would happen in CONFIG_AUDIT=y,
"task,never" case.

+  static inline struct audit_context *audit_context(void)
+  {
+    return current->audit_context;
+  }

+  static inline bool audit_dummy_context(void)
+  {
+    void *p = audit_context();
+    return !p || *(int *)p;
+  }

+  static inline void audit_uring_entry(u8 op)
+  {
+    if (unlikely(audit_enabled && audit_context()))
+      __audit_uring_entry(op);
+  }

We have one if statement where the conditional checks on two
individual conditions.  The first (audit_enabled) is simply a check to
see if anyone has "turned on" auditing at runtime; historically this
worked rather well, and still does in a number of places, but ever
since systemd has taken to forcing audit on regardless of the admin's
audit configuration it is less useful.  The second (audit_context())
is a check to see if an audit_context has been allocated for the
current task.  In the case of "task,never" current->audit_context will
be NULL (see audit_alloc()) and the __audit_uring_entry() slowpath
will never be called.

Worst case here is checking the value of audit_enabled and
current->audit_context.  Depending on which you think is more likely
we can change the order of the check so that the
current->audit_context check is first if you feel that is more likely
to be NULL than audit_enabled is to be false (it may be that way now).

+  static inline void audit_uring_exit(int success, long code)
+  {
+    if (unlikely(!audit_dummy_context()))
+      __audit_uring_exit(success, code);
+  }

The exit call is very similar to the entry call, but in the
"task,never" case it is very simple as the first check to be performed
is the current->audit_context check which we know to be NULL.  The
__audit_uring_exit() slowpath will never be called.

> aio never had any audit logging as far as I can tell. I think it'd make
> a lot more sense to selectively enable audit logging only for opcodes
> that we care about. File open/create/unlink/mkdir etc, that kind of
> thing. File level operations that people would care about logging. Would
> they care about logging a buffer registration or a polled read from a
> device/file? I highly doubt it, and we don't do that for alternative
> methods either. Doesn't really make sense for a lot of the other
> operations, imho.

We would need to check with the current security requirements (there
are distro people on the linux-audit list that keep track of that
stuff), but looking at the opcodes right now my gut feeling is that
most of the opcodes would be considered "security relevant" so
selective auditing might not be that useful in practice.  It would
definitely clutter the code and increase the chances that new opcodes
would not be properly audited when they are merged.
Pavel Begunkov May 26, 2021, 10:19 a.m. UTC | #9
On 5/26/21 3:04 AM, Paul Moore wrote:
> On Tue, May 25, 2021 at 9:11 PM Jens Axboe <axboe@kernel.dk> wrote:
>> On 5/24/21 1:59 PM, Paul Moore wrote:
>>> That said, audit is not for everyone, and we have build time and
>>> runtime options to help make life easier.  Beyond simply disabling
>>> audit at compile time a number of Linux distributions effectively
>>> shortcut audit at runtime by adding a "never" rule to the audit
>>> filter, for example:
>>>
>>>  % auditctl -a task,never
>>
>> As has been brought up, the issue we're facing is that distros have
>> CONFIG_AUDIT=y and hence the above is the best real world case outside
>> of people doing custom kernels. My question would then be how much
>> overhead the above will add, considering it's an entry/exit call per op.
>> If auditctl is turned off, what is the expectation in turns of overhead?
> 
> I commented on that case in my last email to Pavel, but I'll try to go
> over it again in a little more detail.
> 
> As we discussed earlier in this thread, we can skip the req->opcode
> check before both the _entry and _exit calls, so we are left with just
> the bare audit calls in the io_uring code.  As the _entry and _exit
> functions are small, I've copied them and their supporting functions
> below and I'll try to explain what would happen in CONFIG_AUDIT=y,
> "task,never" case.
> 
> +  static inline struct audit_context *audit_context(void)
> +  {
> +    return current->audit_context;
> +  }
> 
> +  static inline bool audit_dummy_context(void)
> +  {
> +    void *p = audit_context();
> +    return !p || *(int *)p;
> +  }
> 
> +  static inline void audit_uring_entry(u8 op)
> +  {
> +    if (unlikely(audit_enabled && audit_context()))
> +      __audit_uring_entry(op);
> +  }

I'd rather agree that it's my cycle-picking. The case I care about
is CONFIG_AUDIT=y (because everybody enable it), and io_uring
tracing _not_ enabled at runtime. If enabled let them suffer
the overhead, it will probably dip down the performance

So, for the case I care about it's two of

if (unlikely(audit_enabled && current->audit_context))

in the hot path. load-test-jump + current, so it will
be around 7x2 instructions. We can throw away audit_enabled
as you say systemd already enables it, that will give
4x2 instructions including 2 conditional jumps.

That's not great at all. And that's why I brought up
the question about need of pre and post hooks and whether
can be combined. Would be just 4 instructions and that is
ok (ish).

> We would need to check with the current security requirements (there
> are distro people on the linux-audit list that keep track of that
> stuff), but looking at the opcodes right now my gut feeling is that
> most of the opcodes would be considered "security relevant" so
> selective auditing might not be that useful in practice.  It would
> definitely clutter the code and increase the chances that new opcodes
> would not be properly audited when they are merged.

I'm curious, why it's enabled by many distros by default? Are there
use cases they use? Tempting to add AUDIT_IOURING=default N, but
won't work I guess
Paul Moore May 26, 2021, 2:38 p.m. UTC | #10
On Wed, May 26, 2021 at 6:19 AM Pavel Begunkov <asml.silence@gmail.com> wrote:
> On 5/26/21 3:04 AM, Paul Moore wrote:
> > On Tue, May 25, 2021 at 9:11 PM Jens Axboe <axboe@kernel.dk> wrote:
> >> On 5/24/21 1:59 PM, Paul Moore wrote:
> >>> That said, audit is not for everyone, and we have build time and
> >>> runtime options to help make life easier.  Beyond simply disabling
> >>> audit at compile time a number of Linux distributions effectively
> >>> shortcut audit at runtime by adding a "never" rule to the audit
> >>> filter, for example:
> >>>
> >>>  % auditctl -a task,never
> >>
> >> As has been brought up, the issue we're facing is that distros have
> >> CONFIG_AUDIT=y and hence the above is the best real world case outside
> >> of people doing custom kernels. My question would then be how much
> >> overhead the above will add, considering it's an entry/exit call per op.
> >> If auditctl is turned off, what is the expectation in turns of overhead?
> >
> > I commented on that case in my last email to Pavel, but I'll try to go
> > over it again in a little more detail.
> >
> > As we discussed earlier in this thread, we can skip the req->opcode
> > check before both the _entry and _exit calls, so we are left with just
> > the bare audit calls in the io_uring code.  As the _entry and _exit
> > functions are small, I've copied them and their supporting functions
> > below and I'll try to explain what would happen in CONFIG_AUDIT=y,
> > "task,never" case.
> >
> > +  static inline struct audit_context *audit_context(void)
> > +  {
> > +    return current->audit_context;
> > +  }
> >
> > +  static inline bool audit_dummy_context(void)
> > +  {
> > +    void *p = audit_context();
> > +    return !p || *(int *)p;
> > +  }
> >
> > +  static inline void audit_uring_entry(u8 op)
> > +  {
> > +    if (unlikely(audit_enabled && audit_context()))
> > +      __audit_uring_entry(op);
> > +  }
>
> I'd rather agree that it's my cycle-picking. The case I care about
> is CONFIG_AUDIT=y (because everybody enable it), and io_uring
> tracing _not_ enabled at runtime. If enabled let them suffer
> the overhead, it will probably dip down the performance
>
> So, for the case I care about it's two of
>
> if (unlikely(audit_enabled && current->audit_context))
>
> in the hot path. load-test-jump + current, so it will
> be around 7x2 instructions. We can throw away audit_enabled
> as you say systemd already enables it, that will give
> 4x2 instructions including 2 conditional jumps.

We've basically got it down to the equivalent of two
"current->audit_context != NULL" checks in the case where audit is
built into the kernel but disabled at runtime, e.g. CONFIG_AUDIT=y and
"task,never".  I'm at a loss for how we can lower the overhead any
further, but I'm open to suggestions.

> That's not great at all. And that's why I brought up
> the question about need of pre and post hooks and whether
> can be combined. Would be just 4 instructions and that is
> ok (ish).

As discussed previously in this thread that isn't really an option
from an audit perspective.

> > We would need to check with the current security requirements (there
> > are distro people on the linux-audit list that keep track of that
> > stuff), but looking at the opcodes right now my gut feeling is that
> > most of the opcodes would be considered "security relevant" so
> > selective auditing might not be that useful in practice.  It would
> > definitely clutter the code and increase the chances that new opcodes
> > would not be properly audited when they are merged.
>
> I'm curious, why it's enabled by many distros by default? Are there
> use cases they use?

We've already talked about certain users and environments where audit
is an important requirement, e.g. public sector, health care,
financial institutions, etc.; without audit Linux wouldn't be an
option for these users, at least not without heavy modification,
out-of-tree/ISV patches, etc.  I currently don't have any direct ties
to any distros, "Enterprise" or otherwise, but in the past it has been
my experience that distros much prefer to have a single kernel build
to address the needs of all their users.  In the few cases I have seen
where a second kernel build is supported it is usually for hardware
enablement.  I'm sure there are other cases too, I just haven't seen
them personally; the big distros definitely seem to have a strong
desire to limit the number of supported kernel configs/builds.

> Tempting to add AUDIT_IOURING=default N, but won't work I guess

One of the nice things about audit is that it can give you a history
of what a user did on a system, which is very important for a number
of use cases.  If we selectively disable audit for certain subsystems
we create a blind spot in the audit log, and in the case of io_uring
this can be a very serious blind spot.  I fear that if we can't come
to some agreement here we will need to make io_uring and audit
mutually exclusive at build time which would be awful; forcing many
distros to either make a hard choice or carry out-of-tree patches.
Steve Grubb May 26, 2021, 3:11 p.m. UTC | #11
On Wednesday, May 26, 2021 10:38:38 AM EDT Paul Moore wrote:
> > > We would need to check with the current security requirements (there
> > > are distro people on the linux-audit list that keep track of that
> > > stuff),

The requirements generally care about resource access. File open, connect, 
accept, etc. We don't care about read/write itself as that would flood the 
analysis.

> > > but looking at the opcodes right now my gut feeling is that
> > > most of the opcodes would be considered "security relevant" so
> > > selective auditing might not be that useful in practice. 

I'd say maybe a quarter to a third look interesting.

> > > It would
> > > definitely clutter the code and increase the chances that new opcodes
> > > would not be properly audited when they are merged.

There is that...

> > I'm curious, why it's enabled by many distros by default? Are there
> > use cases they use?
> 
> We've already talked about certain users and environments where audit
> is an important requirement, e.g. public sector, health care,
> financial institutions, etc.; without audit Linux wouldn't be an
> option for these users,

People that care about auditing are under regulatory mandates. They care more 
about the audit event than the performance. Imagine you have a system with 
some brand new medical discovery. You want to know anyone who accesses the 
information in case it gets leaked out. You don't care how slow the system 
gets - you simply *have* to know everyone who's looked at the documents.

-Steve
Stefan Metzmacher May 26, 2021, 3:17 p.m. UTC | #12
Am 26.05.21 um 16:38 schrieb Paul Moore:
> On Wed, May 26, 2021 at 6:19 AM Pavel Begunkov <asml.silence@gmail.com> wrote:
>> On 5/26/21 3:04 AM, Paul Moore wrote:
>>> On Tue, May 25, 2021 at 9:11 PM Jens Axboe <axboe@kernel.dk> wrote:
>>>> On 5/24/21 1:59 PM, Paul Moore wrote:
>>>>> That said, audit is not for everyone, and we have build time and
>>>>> runtime options to help make life easier.  Beyond simply disabling
>>>>> audit at compile time a number of Linux distributions effectively
>>>>> shortcut audit at runtime by adding a "never" rule to the audit
>>>>> filter, for example:
>>>>>
>>>>>  % auditctl -a task,never
>>>>
>>>> As has been brought up, the issue we're facing is that distros have
>>>> CONFIG_AUDIT=y and hence the above is the best real world case outside
>>>> of people doing custom kernels. My question would then be how much
>>>> overhead the above will add, considering it's an entry/exit call per op.
>>>> If auditctl is turned off, what is the expectation in turns of overhead?
>>>
>>> I commented on that case in my last email to Pavel, but I'll try to go
>>> over it again in a little more detail.
>>>
>>> As we discussed earlier in this thread, we can skip the req->opcode
>>> check before both the _entry and _exit calls, so we are left with just
>>> the bare audit calls in the io_uring code.  As the _entry and _exit
>>> functions are small, I've copied them and their supporting functions
>>> below and I'll try to explain what would happen in CONFIG_AUDIT=y,
>>> "task,never" case.
>>>
>>> +  static inline struct audit_context *audit_context(void)
>>> +  {
>>> +    return current->audit_context;
>>> +  }
>>>
>>> +  static inline bool audit_dummy_context(void)
>>> +  {
>>> +    void *p = audit_context();
>>> +    return !p || *(int *)p;
>>> +  }
>>>
>>> +  static inline void audit_uring_entry(u8 op)
>>> +  {
>>> +    if (unlikely(audit_enabled && audit_context()))
>>> +      __audit_uring_entry(op);
>>> +  }
>>
>> I'd rather agree that it's my cycle-picking. The case I care about
>> is CONFIG_AUDIT=y (because everybody enable it), and io_uring
>> tracing _not_ enabled at runtime. If enabled let them suffer
>> the overhead, it will probably dip down the performance
>>
>> So, for the case I care about it's two of
>>
>> if (unlikely(audit_enabled && current->audit_context))
>>
>> in the hot path. load-test-jump + current, so it will
>> be around 7x2 instructions. We can throw away audit_enabled
>> as you say systemd already enables it, that will give
>> 4x2 instructions including 2 conditional jumps.
> 
> We've basically got it down to the equivalent of two
> "current->audit_context != NULL" checks in the case where audit is
> built into the kernel but disabled at runtime, e.g. CONFIG_AUDIT=y and
> "task,never".  I'm at a loss for how we can lower the overhead any
> further, but I'm open to suggestions.
> 
>> That's not great at all. And that's why I brought up
>> the question about need of pre and post hooks and whether
>> can be combined. Would be just 4 instructions and that is
>> ok (ish).
> 
> As discussed previously in this thread that isn't really an option
> from an audit perspective.
> 
>>> We would need to check with the current security requirements (there
>>> are distro people on the linux-audit list that keep track of that
>>> stuff), but looking at the opcodes right now my gut feeling is that
>>> most of the opcodes would be considered "security relevant" so
>>> selective auditing might not be that useful in practice.  It would
>>> definitely clutter the code and increase the chances that new opcodes
>>> would not be properly audited when they are merged.
>>
>> I'm curious, why it's enabled by many distros by default? Are there
>> use cases they use?
> 
> We've already talked about certain users and environments where audit
> is an important requirement, e.g. public sector, health care,
> financial institutions, etc.; without audit Linux wouldn't be an
> option for these users, at least not without heavy modification,
> out-of-tree/ISV patches, etc.  I currently don't have any direct ties
> to any distros, "Enterprise" or otherwise, but in the past it has been
> my experience that distros much prefer to have a single kernel build
> to address the needs of all their users.  In the few cases I have seen
> where a second kernel build is supported it is usually for hardware
> enablement.  I'm sure there are other cases too, I just haven't seen
> them personally; the big distros definitely seem to have a strong
> desire to limit the number of supported kernel configs/builds.
> 
>> Tempting to add AUDIT_IOURING=default N, but won't work I guess
> 
> One of the nice things about audit is that it can give you a history
> of what a user did on a system, which is very important for a number
> of use cases.  If we selectively disable audit for certain subsystems
> we create a blind spot in the audit log, and in the case of io_uring
> this can be a very serious blind spot.  I fear that if we can't come
> to some agreement here we will need to make io_uring and audit
> mutually exclusive at build time which would be awful; forcing many
> distros to either make a hard choice or carry out-of-tree patches.

I'm wondering why it's not enough to have the native auditing just to happen.

E.g. all (I have checked RECVMSG,SENDMSG,SEND and CONNECT) socket related io_uring opcodes
already go via security_socket_{recvmsg,sendmsg,connect}()

IORING_OP_OPENAT* goes via do_filp_open() which is in common with the open[at[2]]() syscalls
and should also trigger audit_inode() and security_file_open().

So why is there anything special needed for io_uring (now that the native worker threads are used)?

Is there really any io_uring opcode that bypasses the security checks the corresponding native syscall
would do? If so, I think that should just be fixed...

Additional LSM based restrictions could be hooked into the io_check_restriction() path
and setup at io_uring_setup() or early io_uring_register() time.

What do you think?

metze
Richard Guy Briggs May 26, 2021, 3:49 p.m. UTC | #13
On 2021-05-26 17:17, Stefan Metzmacher wrote:
> 
> Am 26.05.21 um 16:38 schrieb Paul Moore:
> > On Wed, May 26, 2021 at 6:19 AM Pavel Begunkov <asml.silence@gmail.com> wrote:
> >> On 5/26/21 3:04 AM, Paul Moore wrote:
> >>> On Tue, May 25, 2021 at 9:11 PM Jens Axboe <axboe@kernel.dk> wrote:
> >>>> On 5/24/21 1:59 PM, Paul Moore wrote:
> >>>>> That said, audit is not for everyone, and we have build time and
> >>>>> runtime options to help make life easier.  Beyond simply disabling
> >>>>> audit at compile time a number of Linux distributions effectively
> >>>>> shortcut audit at runtime by adding a "never" rule to the audit
> >>>>> filter, for example:
> >>>>>
> >>>>>  % auditctl -a task,never
> >>>>
> >>>> As has been brought up, the issue we're facing is that distros have
> >>>> CONFIG_AUDIT=y and hence the above is the best real world case outside
> >>>> of people doing custom kernels. My question would then be how much
> >>>> overhead the above will add, considering it's an entry/exit call per op.
> >>>> If auditctl is turned off, what is the expectation in turns of overhead?
> >>>
> >>> I commented on that case in my last email to Pavel, but I'll try to go
> >>> over it again in a little more detail.
> >>>
> >>> As we discussed earlier in this thread, we can skip the req->opcode
> >>> check before both the _entry and _exit calls, so we are left with just
> >>> the bare audit calls in the io_uring code.  As the _entry and _exit
> >>> functions are small, I've copied them and their supporting functions
> >>> below and I'll try to explain what would happen in CONFIG_AUDIT=y,
> >>> "task,never" case.
> >>>
> >>> +  static inline struct audit_context *audit_context(void)
> >>> +  {
> >>> +    return current->audit_context;
> >>> +  }
> >>>
> >>> +  static inline bool audit_dummy_context(void)
> >>> +  {
> >>> +    void *p = audit_context();
> >>> +    return !p || *(int *)p;
> >>> +  }
> >>>
> >>> +  static inline void audit_uring_entry(u8 op)
> >>> +  {
> >>> +    if (unlikely(audit_enabled && audit_context()))
> >>> +      __audit_uring_entry(op);
> >>> +  }
> >>
> >> I'd rather agree that it's my cycle-picking. The case I care about
> >> is CONFIG_AUDIT=y (because everybody enable it), and io_uring
> >> tracing _not_ enabled at runtime. If enabled let them suffer
> >> the overhead, it will probably dip down the performance
> >>
> >> So, for the case I care about it's two of
> >>
> >> if (unlikely(audit_enabled && current->audit_context))
> >>
> >> in the hot path. load-test-jump + current, so it will
> >> be around 7x2 instructions. We can throw away audit_enabled
> >> as you say systemd already enables it, that will give
> >> 4x2 instructions including 2 conditional jumps.
> > 
> > We've basically got it down to the equivalent of two
> > "current->audit_context != NULL" checks in the case where audit is
> > built into the kernel but disabled at runtime, e.g. CONFIG_AUDIT=y and
> > "task,never".  I'm at a loss for how we can lower the overhead any
> > further, but I'm open to suggestions.
> > 
> >> That's not great at all. And that's why I brought up
> >> the question about need of pre and post hooks and whether
> >> can be combined. Would be just 4 instructions and that is
> >> ok (ish).
> > 
> > As discussed previously in this thread that isn't really an option
> > from an audit perspective.
> > 
> >>> We would need to check with the current security requirements (there
> >>> are distro people on the linux-audit list that keep track of that
> >>> stuff), but looking at the opcodes right now my gut feeling is that
> >>> most of the opcodes would be considered "security relevant" so
> >>> selective auditing might not be that useful in practice.  It would
> >>> definitely clutter the code and increase the chances that new opcodes
> >>> would not be properly audited when they are merged.
> >>
> >> I'm curious, why it's enabled by many distros by default? Are there
> >> use cases they use?
> > 
> > We've already talked about certain users and environments where audit
> > is an important requirement, e.g. public sector, health care,
> > financial institutions, etc.; without audit Linux wouldn't be an
> > option for these users, at least not without heavy modification,
> > out-of-tree/ISV patches, etc.  I currently don't have any direct ties
> > to any distros, "Enterprise" or otherwise, but in the past it has been
> > my experience that distros much prefer to have a single kernel build
> > to address the needs of all their users.  In the few cases I have seen
> > where a second kernel build is supported it is usually for hardware
> > enablement.  I'm sure there are other cases too, I just haven't seen
> > them personally; the big distros definitely seem to have a strong
> > desire to limit the number of supported kernel configs/builds.
> > 
> >> Tempting to add AUDIT_IOURING=default N, but won't work I guess
> > 
> > One of the nice things about audit is that it can give you a history
> > of what a user did on a system, which is very important for a number
> > of use cases.  If we selectively disable audit for certain subsystems
> > we create a blind spot in the audit log, and in the case of io_uring
> > this can be a very serious blind spot.  I fear that if we can't come
> > to some agreement here we will need to make io_uring and audit
> > mutually exclusive at build time which would be awful; forcing many
> > distros to either make a hard choice or carry out-of-tree patches.
> 
> I'm wondering why it's not enough to have the native auditing just to happen.

The audit context needs to be set up for each event.  This happens in
audit_syslog_entry and audit_syslog_exit.

> E.g. all (I have checked RECVMSG,SENDMSG,SEND and CONNECT) socket related io_uring opcodes
> already go via security_socket_{recvmsg,sendmsg,connect}()
> 
> IORING_OP_OPENAT* goes via do_filp_open() which is in common with the open[at[2]]() syscalls
> and should also trigger audit_inode() and security_file_open().

These are extra hooks to grab operation-specific (syscall) parameters.

> So why is there anything special needed for io_uring (now that the native worker threads are used)?

Because syscall has been bypassed by a memory-mapped work queue.

> Is there really any io_uring opcode that bypasses the security checks the corresponding native syscall
> would do? If so, I think that should just be fixed...

This is by design to speed it up.  This is what Paul's iouring entry and
exit hooks do.

> Additional LSM based restrictions could be hooked into the io_check_restriction() path
> and setup at io_uring_setup() or early io_uring_register() time.
> 
> What do you think?
> 
> metze

- RGB

--
Richard Guy Briggs <rgb@redhat.com>
Sr. S/W Engineer, Kernel Security, Base Operating Systems
Remote, Ottawa, Red Hat Canada
IRC: rgb, SunRaycer
Voice: +1.647.777.2635, Internal: (81) 32635
Victor Stewart May 26, 2021, 3:49 p.m. UTC | #14
> I'm wondering why it's not enough to have the native auditing just to happen.
>
> E.g. all (I have checked RECVMSG,SENDMSG,SEND and CONNECT) socket related io_uring opcodes
> already go via security_socket_{recvmsg,sendmsg,connect}()
>
> IORING_OP_OPENAT* goes via do_filp_open() which is in common with the open[at[2]]() syscalls
> and should also trigger audit_inode() and security_file_open().
>
> So why is there anything special needed for io_uring (now that the native worker threads are used)?
>
> Is there really any io_uring opcode that bypasses the security checks the corresponding native syscall
> would do? If so, I think that should just be fixed...

stefan's points crossed my mind as well.

but assuming iouring buy-in is required, from a design perspective,
rather than inserting these audit conditionals in the hotpath,
wouldn't a layering model work better?
aka enabling auditing changes the function entry point into io_uring
and passes operations through an auditing layer, then back to the main
entry point. then there is no
cost to audit disabled code, and you just force audit to pay whatever
double processing cost that entails.

V
Casey Schaufler May 26, 2021, 4:38 p.m. UTC | #15
On 5/26/2021 8:49 AM, Victor Stewart wrote:
>> I'm wondering why it's not enough to have the native auditing just to happen.
>>
>> E.g. all (I have checked RECVMSG,SENDMSG,SEND and CONNECT) socket related io_uring opcodes
>> already go via security_socket_{recvmsg,sendmsg,connect}()
>>
>> IORING_OP_OPENAT* goes via do_filp_open() which is in common with the open[at[2]]() syscalls
>> and should also trigger audit_inode() and security_file_open().
>>
>> So why is there anything special needed for io_uring (now that the native worker threads are used)?
>>
>> Is there really any io_uring opcode that bypasses the security checks the corresponding native syscall
>> would do? If so, I think that should just be fixed...
> stefan's points crossed my mind as well.
>
> but assuming iouring buy-in is required, from a design perspective,
> rather than inserting these audit conditionals in the hotpath,
> wouldn't a layering model work better?
> aka enabling auditing changes the function entry point into io_uring
> and passes operations through an auditing layer, then back to the main
> entry point. then there is no
> cost to audit disabled code, and you just force audit to pay whatever
> double processing cost that entails.

There seems to be an assumption that the audit code isn't concerned
about processing cost. This is complete nonsense. While an audit system
has to be accurate and complete (hence encompassing io_uring) it also
has to be sufficiently performant for the system to be useful when it
is enabled. It would have been real handy had the audit and LSM
requirements been designed into io_uring. The performance implications
could have been addressed up front. Instead it all has to be retrofit.
io_uring, Audit and LSM need to perform well and the best way to ensure
that the combination works is cooperation between the developers. Any
response that includes the word "just" is unlikely to be helpful.


>
> V
>
> --
> Linux-audit mailing list
> Linux-audit@redhat.com
> https://listman.redhat.com/mailman/listinfo/linux-audit
>
Jens Axboe May 26, 2021, 5:15 p.m. UTC | #16
On 5/25/21 8:04 PM, Paul Moore wrote:
> On Tue, May 25, 2021 at 9:11 PM Jens Axboe <axboe@kernel.dk> wrote:
>> On 5/24/21 1:59 PM, Paul Moore wrote:
>>> That said, audit is not for everyone, and we have build time and
>>> runtime options to help make life easier.  Beyond simply disabling
>>> audit at compile time a number of Linux distributions effectively
>>> shortcut audit at runtime by adding a "never" rule to the audit
>>> filter, for example:
>>>
>>>  % auditctl -a task,never
>>
>> As has been brought up, the issue we're facing is that distros have
>> CONFIG_AUDIT=y and hence the above is the best real world case outside
>> of people doing custom kernels. My question would then be how much
>> overhead the above will add, considering it's an entry/exit call per op.
>> If auditctl is turned off, what is the expectation in turns of overhead?
> 
> I commented on that case in my last email to Pavel, but I'll try to go
> over it again in a little more detail.
> 
> As we discussed earlier in this thread, we can skip the req->opcode
> check before both the _entry and _exit calls, so we are left with just
> the bare audit calls in the io_uring code.  As the _entry and _exit
> functions are small, I've copied them and their supporting functions
> below and I'll try to explain what would happen in CONFIG_AUDIT=y,
> "task,never" case.
> 
> +  static inline struct audit_context *audit_context(void)
> +  {
> +    return current->audit_context;
> +  }
> 
> +  static inline bool audit_dummy_context(void)
> +  {
> +    void *p = audit_context();
> +    return !p || *(int *)p;
> +  }
> 
> +  static inline void audit_uring_entry(u8 op)
> +  {
> +    if (unlikely(audit_enabled && audit_context()))
> +      __audit_uring_entry(op);
> +  }
> 
> We have one if statement where the conditional checks on two
> individual conditions.  The first (audit_enabled) is simply a check to
> see if anyone has "turned on" auditing at runtime; historically this
> worked rather well, and still does in a number of places, but ever
> since systemd has taken to forcing audit on regardless of the admin's
> audit configuration it is less useful.  The second (audit_context())
> is a check to see if an audit_context has been allocated for the
> current task.  In the case of "task,never" current->audit_context will
> be NULL (see audit_alloc()) and the __audit_uring_entry() slowpath
> will never be called.
> 
> Worst case here is checking the value of audit_enabled and
> current->audit_context.  Depending on which you think is more likely
> we can change the order of the check so that the
> current->audit_context check is first if you feel that is more likely
> to be NULL than audit_enabled is to be false (it may be that way now).
> 
> +  static inline void audit_uring_exit(int success, long code)
> +  {
> +    if (unlikely(!audit_dummy_context()))
> +      __audit_uring_exit(success, code);
> +  }
> 
> The exit call is very similar to the entry call, but in the
> "task,never" case it is very simple as the first check to be performed
> is the current->audit_context check which we know to be NULL.  The
> __audit_uring_exit() slowpath will never be called.

I actually ran some numbers this morning. The test base is 5.13+, and
CONFIG_AUDIT=y and CONFIG_AUDITSYSCALL=y is set for both the baseline
test and the test with this series applied. I used your git branch as of
this morning.

The test case is my usual peak perf test, which is random reads at
QD=128 and using polled IO. It's a single core test, not threaded. I ran
two different tests - one was having a thread just do the IO, the other
is using SQPOLL to do the IO for us. The device is capable than more
IOPS than a single core can deliver, so we're CPU limited in this test.
Hence it's a good test case as it does actual work, and shows software
overhead quite nicely. Runs are very stable (less than 0.5% difference
between runs on the same base), yet I did average 4 runs.

Kernel		SQPOLL		IOPS		Perf diff
---------------------------------------------------------
5.13		0		3029872		0.0%
5.13		1		3031056		0.0%
5.13 + audit	0		2894160		-4.5%
5.13 + audit	1		2886168		-4.8%

That's an immediate drop in perf of almost 5%. Looking at a quick
profile of it (nothing fancy, just checking for 'audit' in the profile)
shows this:

+    2.17%  io_uring  [kernel.vmlinux]  [k] __audit_uring_entry
+    0.71%  io_uring  [kernel.vmlinux]  [k] __audit_uring_exit
     0.07%  io_uring  [kernel.vmlinux]  [k] __audit_syscall_entry
     0.02%  io_uring  [kernel.vmlinux]  [k] __audit_syscall_exit

Note that this is with _no_ rules!

>> aio never had any audit logging as far as I can tell. I think it'd make
>> a lot more sense to selectively enable audit logging only for opcodes
>> that we care about. File open/create/unlink/mkdir etc, that kind of
>> thing. File level operations that people would care about logging. Would
>> they care about logging a buffer registration or a polled read from a
>> device/file? I highly doubt it, and we don't do that for alternative
>> methods either. Doesn't really make sense for a lot of the other
>> operations, imho.
> 
> We would need to check with the current security requirements (there
> are distro people on the linux-audit list that keep track of that
> stuff), but looking at the opcodes right now my gut feeling is that
> most of the opcodes would be considered "security relevant" so
> selective auditing might not be that useful in practice.  It would
> definitely clutter the code and increase the chances that new opcodes
> would not be properly audited when they are merged.

We don't audit read/write from aio, as mentioned. In the past two
decades, I take it that hasn't been a concern? I agree that some opcodes
should _definitely_ be audited. Things like opening a file, closing a
file, removing/creating a file, mount, etc. But normal read/write, I
think that's just utter noise and not useful at all. Auditing on a
per-opcode basis is trivial, io_uring already has provisions for
flagging opcode requirements and this would just be another one.
Jens Axboe May 26, 2021, 5:22 p.m. UTC | #17
On 5/26/21 9:49 AM, Richard Guy Briggs wrote:
>> So why is there anything special needed for io_uring (now that the
>> native worker threads are used)?
> 
> Because syscall has been bypassed by a memory-mapped work queue.

I don't follow this one at all, that's just the delivery mechanism if
you choose to use SQPOLL. If you do, then a thread sibling of the
original task does the actual system call. There's no magic involved
there, and the tasks are related.

So care to expand on that?

>> Is there really any io_uring opcode that bypasses the security checks the corresponding native syscall
>> would do? If so, I think that should just be fixed...
> 
> This is by design to speed it up.  This is what Paul's iouring entry and
> exit hooks do.

As far as I can tell, we're doing double logging at that point, if the
syscall helper does the audit as well. We'll get something logging the
io_uring opcode (eg IORING_OP_OPENAT2), then log again when we call the
fs helper. That's _assuming_ that we always hit the logging part when we
call into the vfs, but that's something that can be updated to be true
and kept an eye on for future additions.

Why is the double logging useful? It only tells you that the invocation
was via io_uring as the delivery mechanism rather than the usual system
call, but the effect is the same - the file is opened, for example.

I feel like I'm missing something here, or the other side is. Or both!
Jens Axboe May 26, 2021, 5:31 p.m. UTC | #18
On 5/26/21 11:15 AM, Jens Axboe wrote:
> On 5/25/21 8:04 PM, Paul Moore wrote:
>> On Tue, May 25, 2021 at 9:11 PM Jens Axboe <axboe@kernel.dk> wrote:
>>> On 5/24/21 1:59 PM, Paul Moore wrote:
>>>> That said, audit is not for everyone, and we have build time and
>>>> runtime options to help make life easier.  Beyond simply disabling
>>>> audit at compile time a number of Linux distributions effectively
>>>> shortcut audit at runtime by adding a "never" rule to the audit
>>>> filter, for example:
>>>>
>>>>  % auditctl -a task,never
>>>
>>> As has been brought up, the issue we're facing is that distros have
>>> CONFIG_AUDIT=y and hence the above is the best real world case outside
>>> of people doing custom kernels. My question would then be how much
>>> overhead the above will add, considering it's an entry/exit call per op.
>>> If auditctl is turned off, what is the expectation in turns of overhead?
>>
>> I commented on that case in my last email to Pavel, but I'll try to go
>> over it again in a little more detail.
>>
>> As we discussed earlier in this thread, we can skip the req->opcode
>> check before both the _entry and _exit calls, so we are left with just
>> the bare audit calls in the io_uring code.  As the _entry and _exit
>> functions are small, I've copied them and their supporting functions
>> below and I'll try to explain what would happen in CONFIG_AUDIT=y,
>> "task,never" case.
>>
>> +  static inline struct audit_context *audit_context(void)
>> +  {
>> +    return current->audit_context;
>> +  }
>>
>> +  static inline bool audit_dummy_context(void)
>> +  {
>> +    void *p = audit_context();
>> +    return !p || *(int *)p;
>> +  }
>>
>> +  static inline void audit_uring_entry(u8 op)
>> +  {
>> +    if (unlikely(audit_enabled && audit_context()))
>> +      __audit_uring_entry(op);
>> +  }
>>
>> We have one if statement where the conditional checks on two
>> individual conditions.  The first (audit_enabled) is simply a check to
>> see if anyone has "turned on" auditing at runtime; historically this
>> worked rather well, and still does in a number of places, but ever
>> since systemd has taken to forcing audit on regardless of the admin's
>> audit configuration it is less useful.  The second (audit_context())
>> is a check to see if an audit_context has been allocated for the
>> current task.  In the case of "task,never" current->audit_context will
>> be NULL (see audit_alloc()) and the __audit_uring_entry() slowpath
>> will never be called.
>>
>> Worst case here is checking the value of audit_enabled and
>> current->audit_context.  Depending on which you think is more likely
>> we can change the order of the check so that the
>> current->audit_context check is first if you feel that is more likely
>> to be NULL than audit_enabled is to be false (it may be that way now).
>>
>> +  static inline void audit_uring_exit(int success, long code)
>> +  {
>> +    if (unlikely(!audit_dummy_context()))
>> +      __audit_uring_exit(success, code);
>> +  }
>>
>> The exit call is very similar to the entry call, but in the
>> "task,never" case it is very simple as the first check to be performed
>> is the current->audit_context check which we know to be NULL.  The
>> __audit_uring_exit() slowpath will never be called.
> 
> I actually ran some numbers this morning. The test base is 5.13+, and
> CONFIG_AUDIT=y and CONFIG_AUDITSYSCALL=y is set for both the baseline
> test and the test with this series applied. I used your git branch as of
> this morning.
> 
> The test case is my usual peak perf test, which is random reads at
> QD=128 and using polled IO. It's a single core test, not threaded. I ran
> two different tests - one was having a thread just do the IO, the other
> is using SQPOLL to do the IO for us. The device is capable than more
> IOPS than a single core can deliver, so we're CPU limited in this test.
> Hence it's a good test case as it does actual work, and shows software
> overhead quite nicely. Runs are very stable (less than 0.5% difference
> between runs on the same base), yet I did average 4 runs.
> 
> Kernel		SQPOLL		IOPS		Perf diff
> ---------------------------------------------------------
> 5.13		0		3029872		0.0%
> 5.13		1		3031056		0.0%
> 5.13 + audit	0		2894160		-4.5%
> 5.13 + audit	1		2886168		-4.8%
> 
> That's an immediate drop in perf of almost 5%. Looking at a quick
> profile of it (nothing fancy, just checking for 'audit' in the profile)
> shows this:
> 
> +    2.17%  io_uring  [kernel.vmlinux]  [k] __audit_uring_entry
> +    0.71%  io_uring  [kernel.vmlinux]  [k] __audit_uring_exit
>      0.07%  io_uring  [kernel.vmlinux]  [k] __audit_syscall_entry
>      0.02%  io_uring  [kernel.vmlinux]  [k] __audit_syscall_exit
> 
> Note that this is with _no_ rules!

io_uring also supports a NOP command, which basically just measures
reqs/sec through the interface. Ran that as well:

Kernel		SQPOLL		IOPS		Perf diff
---------------------------------------------------------
5.13		0		31.05M		0.0%
5.13 + audit	0		25.31M		-18.5%

and profile for the latter includes:

+    5.19%  io_uring  [kernel.vmlinux]  [k] __audit_uring_entry
+    4.31%  io_uring  [kernel.vmlinux]  [k] __audit_uring_exit
     0.26%  io_uring  [kernel.vmlinux]  [k] __audit_syscall_entry
     0.08%  io_uring  [kernel.vmlinux]  [k] __audit_syscall_exit
Jens Axboe May 26, 2021, 5:54 p.m. UTC | #19
On 5/26/21 11:31 AM, Jens Axboe wrote:
> On 5/26/21 11:15 AM, Jens Axboe wrote:
>> On 5/25/21 8:04 PM, Paul Moore wrote:
>>> On Tue, May 25, 2021 at 9:11 PM Jens Axboe <axboe@kernel.dk> wrote:
>>>> On 5/24/21 1:59 PM, Paul Moore wrote:
>>>>> That said, audit is not for everyone, and we have build time and
>>>>> runtime options to help make life easier.  Beyond simply disabling
>>>>> audit at compile time a number of Linux distributions effectively
>>>>> shortcut audit at runtime by adding a "never" rule to the audit
>>>>> filter, for example:
>>>>>
>>>>>  % auditctl -a task,never
>>>>
>>>> As has been brought up, the issue we're facing is that distros have
>>>> CONFIG_AUDIT=y and hence the above is the best real world case outside
>>>> of people doing custom kernels. My question would then be how much
>>>> overhead the above will add, considering it's an entry/exit call per op.
>>>> If auditctl is turned off, what is the expectation in turns of overhead?
>>>
>>> I commented on that case in my last email to Pavel, but I'll try to go
>>> over it again in a little more detail.
>>>
>>> As we discussed earlier in this thread, we can skip the req->opcode
>>> check before both the _entry and _exit calls, so we are left with just
>>> the bare audit calls in the io_uring code.  As the _entry and _exit
>>> functions are small, I've copied them and their supporting functions
>>> below and I'll try to explain what would happen in CONFIG_AUDIT=y,
>>> "task,never" case.
>>>
>>> +  static inline struct audit_context *audit_context(void)
>>> +  {
>>> +    return current->audit_context;
>>> +  }
>>>
>>> +  static inline bool audit_dummy_context(void)
>>> +  {
>>> +    void *p = audit_context();
>>> +    return !p || *(int *)p;
>>> +  }
>>>
>>> +  static inline void audit_uring_entry(u8 op)
>>> +  {
>>> +    if (unlikely(audit_enabled && audit_context()))
>>> +      __audit_uring_entry(op);
>>> +  }
>>>
>>> We have one if statement where the conditional checks on two
>>> individual conditions.  The first (audit_enabled) is simply a check to
>>> see if anyone has "turned on" auditing at runtime; historically this
>>> worked rather well, and still does in a number of places, but ever
>>> since systemd has taken to forcing audit on regardless of the admin's
>>> audit configuration it is less useful.  The second (audit_context())
>>> is a check to see if an audit_context has been allocated for the
>>> current task.  In the case of "task,never" current->audit_context will
>>> be NULL (see audit_alloc()) and the __audit_uring_entry() slowpath
>>> will never be called.
>>>
>>> Worst case here is checking the value of audit_enabled and
>>> current->audit_context.  Depending on which you think is more likely
>>> we can change the order of the check so that the
>>> current->audit_context check is first if you feel that is more likely
>>> to be NULL than audit_enabled is to be false (it may be that way now).
>>>
>>> +  static inline void audit_uring_exit(int success, long code)
>>> +  {
>>> +    if (unlikely(!audit_dummy_context()))
>>> +      __audit_uring_exit(success, code);
>>> +  }
>>>
>>> The exit call is very similar to the entry call, but in the
>>> "task,never" case it is very simple as the first check to be performed
>>> is the current->audit_context check which we know to be NULL.  The
>>> __audit_uring_exit() slowpath will never be called.
>>
>> I actually ran some numbers this morning. The test base is 5.13+, and
>> CONFIG_AUDIT=y and CONFIG_AUDITSYSCALL=y is set for both the baseline
>> test and the test with this series applied. I used your git branch as of
>> this morning.
>>
>> The test case is my usual peak perf test, which is random reads at
>> QD=128 and using polled IO. It's a single core test, not threaded. I ran
>> two different tests - one was having a thread just do the IO, the other
>> is using SQPOLL to do the IO for us. The device is capable than more
>> IOPS than a single core can deliver, so we're CPU limited in this test.
>> Hence it's a good test case as it does actual work, and shows software
>> overhead quite nicely. Runs are very stable (less than 0.5% difference
>> between runs on the same base), yet I did average 4 runs.
>>
>> Kernel		SQPOLL		IOPS		Perf diff
>> ---------------------------------------------------------
>> 5.13		0		3029872		0.0%
>> 5.13		1		3031056		0.0%
>> 5.13 + audit	0		2894160		-4.5%
>> 5.13 + audit	1		2886168		-4.8%
>>
>> That's an immediate drop in perf of almost 5%. Looking at a quick
>> profile of it (nothing fancy, just checking for 'audit' in the profile)
>> shows this:
>>
>> +    2.17%  io_uring  [kernel.vmlinux]  [k] __audit_uring_entry
>> +    0.71%  io_uring  [kernel.vmlinux]  [k] __audit_uring_exit
>>      0.07%  io_uring  [kernel.vmlinux]  [k] __audit_syscall_entry
>>      0.02%  io_uring  [kernel.vmlinux]  [k] __audit_syscall_exit
>>
>> Note that this is with _no_ rules!
> 
> io_uring also supports a NOP command, which basically just measures
> reqs/sec through the interface. Ran that as well:
> 
> Kernel		SQPOLL		IOPS		Perf diff
> ---------------------------------------------------------
> 5.13		0		31.05M		0.0%
> 5.13 + audit	0		25.31M		-18.5%
> 
> and profile for the latter includes:
> 
> +    5.19%  io_uring  [kernel.vmlinux]  [k] __audit_uring_entry
> +    4.31%  io_uring  [kernel.vmlinux]  [k] __audit_uring_exit
>      0.26%  io_uring  [kernel.vmlinux]  [k] __audit_syscall_entry
>      0.08%  io_uring  [kernel.vmlinux]  [k] __audit_syscall_exit

As Pavel correctly pointed it, looks like auditing is enabled. And
indeed it was! Hence the above numbers is without having turned off
auditing. Running the NOPs after having turned off audit, we get 30.6M
IOPS, which is down about 1.5% from the baseline. The results for the
polled random read test above did _not_ change from this, they are still
down the same amount.

Note, and I should have included this in the first email, this is not
any kind of argument for or against audit logging. It's purely meant to
be a set of numbers that show how the current series impacts
performance.
Jens Axboe May 26, 2021, 6:01 p.m. UTC | #20
On 5/26/21 11:54 AM, Jens Axboe wrote:
> On 5/26/21 11:31 AM, Jens Axboe wrote:
>> On 5/26/21 11:15 AM, Jens Axboe wrote:
>>> On 5/25/21 8:04 PM, Paul Moore wrote:
>>>> On Tue, May 25, 2021 at 9:11 PM Jens Axboe <axboe@kernel.dk> wrote:
>>>>> On 5/24/21 1:59 PM, Paul Moore wrote:
>>>>>> That said, audit is not for everyone, and we have build time and
>>>>>> runtime options to help make life easier.  Beyond simply disabling
>>>>>> audit at compile time a number of Linux distributions effectively
>>>>>> shortcut audit at runtime by adding a "never" rule to the audit
>>>>>> filter, for example:
>>>>>>
>>>>>>  % auditctl -a task,never
>>>>>
>>>>> As has been brought up, the issue we're facing is that distros have
>>>>> CONFIG_AUDIT=y and hence the above is the best real world case outside
>>>>> of people doing custom kernels. My question would then be how much
>>>>> overhead the above will add, considering it's an entry/exit call per op.
>>>>> If auditctl is turned off, what is the expectation in turns of overhead?
>>>>
>>>> I commented on that case in my last email to Pavel, but I'll try to go
>>>> over it again in a little more detail.
>>>>
>>>> As we discussed earlier in this thread, we can skip the req->opcode
>>>> check before both the _entry and _exit calls, so we are left with just
>>>> the bare audit calls in the io_uring code.  As the _entry and _exit
>>>> functions are small, I've copied them and their supporting functions
>>>> below and I'll try to explain what would happen in CONFIG_AUDIT=y,
>>>> "task,never" case.
>>>>
>>>> +  static inline struct audit_context *audit_context(void)
>>>> +  {
>>>> +    return current->audit_context;
>>>> +  }
>>>>
>>>> +  static inline bool audit_dummy_context(void)
>>>> +  {
>>>> +    void *p = audit_context();
>>>> +    return !p || *(int *)p;
>>>> +  }
>>>>
>>>> +  static inline void audit_uring_entry(u8 op)
>>>> +  {
>>>> +    if (unlikely(audit_enabled && audit_context()))
>>>> +      __audit_uring_entry(op);
>>>> +  }
>>>>
>>>> We have one if statement where the conditional checks on two
>>>> individual conditions.  The first (audit_enabled) is simply a check to
>>>> see if anyone has "turned on" auditing at runtime; historically this
>>>> worked rather well, and still does in a number of places, but ever
>>>> since systemd has taken to forcing audit on regardless of the admin's
>>>> audit configuration it is less useful.  The second (audit_context())
>>>> is a check to see if an audit_context has been allocated for the
>>>> current task.  In the case of "task,never" current->audit_context will
>>>> be NULL (see audit_alloc()) and the __audit_uring_entry() slowpath
>>>> will never be called.
>>>>
>>>> Worst case here is checking the value of audit_enabled and
>>>> current->audit_context.  Depending on which you think is more likely
>>>> we can change the order of the check so that the
>>>> current->audit_context check is first if you feel that is more likely
>>>> to be NULL than audit_enabled is to be false (it may be that way now).
>>>>
>>>> +  static inline void audit_uring_exit(int success, long code)
>>>> +  {
>>>> +    if (unlikely(!audit_dummy_context()))
>>>> +      __audit_uring_exit(success, code);
>>>> +  }
>>>>
>>>> The exit call is very similar to the entry call, but in the
>>>> "task,never" case it is very simple as the first check to be performed
>>>> is the current->audit_context check which we know to be NULL.  The
>>>> __audit_uring_exit() slowpath will never be called.
>>>
>>> I actually ran some numbers this morning. The test base is 5.13+, and
>>> CONFIG_AUDIT=y and CONFIG_AUDITSYSCALL=y is set for both the baseline
>>> test and the test with this series applied. I used your git branch as of
>>> this morning.
>>>
>>> The test case is my usual peak perf test, which is random reads at
>>> QD=128 and using polled IO. It's a single core test, not threaded. I ran
>>> two different tests - one was having a thread just do the IO, the other
>>> is using SQPOLL to do the IO for us. The device is capable than more
>>> IOPS than a single core can deliver, so we're CPU limited in this test.
>>> Hence it's a good test case as it does actual work, and shows software
>>> overhead quite nicely. Runs are very stable (less than 0.5% difference
>>> between runs on the same base), yet I did average 4 runs.
>>>
>>> Kernel		SQPOLL		IOPS		Perf diff
>>> ---------------------------------------------------------
>>> 5.13		0		3029872		0.0%
>>> 5.13		1		3031056		0.0%
>>> 5.13 + audit	0		2894160		-4.5%
>>> 5.13 + audit	1		2886168		-4.8%
>>>
>>> That's an immediate drop in perf of almost 5%. Looking at a quick
>>> profile of it (nothing fancy, just checking for 'audit' in the profile)
>>> shows this:
>>>
>>> +    2.17%  io_uring  [kernel.vmlinux]  [k] __audit_uring_entry
>>> +    0.71%  io_uring  [kernel.vmlinux]  [k] __audit_uring_exit
>>>      0.07%  io_uring  [kernel.vmlinux]  [k] __audit_syscall_entry
>>>      0.02%  io_uring  [kernel.vmlinux]  [k] __audit_syscall_exit
>>>
>>> Note that this is with _no_ rules!
>>
>> io_uring also supports a NOP command, which basically just measures
>> reqs/sec through the interface. Ran that as well:
>>
>> Kernel		SQPOLL		IOPS		Perf diff
>> ---------------------------------------------------------
>> 5.13		0		31.05M		0.0%
>> 5.13 + audit	0		25.31M		-18.5%
>>
>> and profile for the latter includes:
>>
>> +    5.19%  io_uring  [kernel.vmlinux]  [k] __audit_uring_entry
>> +    4.31%  io_uring  [kernel.vmlinux]  [k] __audit_uring_exit
>>      0.26%  io_uring  [kernel.vmlinux]  [k] __audit_syscall_entry
>>      0.08%  io_uring  [kernel.vmlinux]  [k] __audit_syscall_exit
> 
> As Pavel correctly pointed it, looks like auditing is enabled. And
> indeed it was! Hence the above numbers is without having turned off
> auditing. Running the NOPs after having turned off audit, we get 30.6M
> IOPS, which is down about 1.5% from the baseline. The results for the
> polled random read test above did _not_ change from this, they are still
> down the same amount.
> 
> Note, and I should have included this in the first email, this is not
> any kind of argument for or against audit logging. It's purely meant to
> be a set of numbers that show how the current series impacts
> performance.

And finally, just checking if we make it optional per opcode if we see
any real impact, and the answer is no. Using the below patch which
effectively bypasses audit calls unless the opcode has flagged the need
to do so, I cannot measure any difference in perf (as expected).

To turn this into something useful, my suggestion as a viable path
forward would be:

1) Use something like the below patch and flag request types that we
   want to do audit logging for.

2) As Pavel suggested, eliminate the need for having both and entry/exit
   hook, turning it into just one. That effectively cuts the number of
   checks and calls in half.

diff --git a/fs/io_uring.c b/fs/io_uring.c
index aa065808ddcf..2c7c913b786b 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -885,6 +885,8 @@ struct io_op_def {
 	unsigned		needs_async_setup : 1;
 	/* should block plug */
 	unsigned		plug : 1;
+	/* should audit */
+	unsigned		audit : 1;
 	/* size of async data needed, if any */
 	unsigned short		async_size;
 };
@@ -6122,7 +6124,7 @@ static int io_issue_sqe(struct io_kiocb *req, unsigned int issue_flags)
 	if (req->work.creds && req->work.creds != current_cred())
 		creds = override_creds(req->work.creds);
 
-	if (req->opcode < IORING_OP_LAST)
+	if (io_op_defs[req->opcode].audit)
 		audit_uring_entry(req->opcode);
 
 	switch (req->opcode) {
@@ -6231,7 +6233,7 @@ static int io_issue_sqe(struct io_kiocb *req, unsigned int issue_flags)
 		break;
 	}
 
-	if (req->opcode < IORING_OP_LAST)
+	if (io_op_defs[req->opcode].audit)
 		audit_uring_exit(!ret, ret);
 
 	if (creds)
Paul Moore May 26, 2021, 6:38 p.m. UTC | #21
On Wed, May 26, 2021 at 1:54 PM Jens Axboe <axboe@kernel.dk> wrote:
> On 5/26/21 11:31 AM, Jens Axboe wrote:
> > On 5/26/21 11:15 AM, Jens Axboe wrote:
> >> On 5/25/21 8:04 PM, Paul Moore wrote:
> >>> On Tue, May 25, 2021 at 9:11 PM Jens Axboe <axboe@kernel.dk> wrote:
> >>>> On 5/24/21 1:59 PM, Paul Moore wrote:
> >>>>> That said, audit is not for everyone, and we have build time and
> >>>>> runtime options to help make life easier.  Beyond simply disabling
> >>>>> audit at compile time a number of Linux distributions effectively
> >>>>> shortcut audit at runtime by adding a "never" rule to the audit
> >>>>> filter, for example:
> >>>>>
> >>>>>  % auditctl -a task,never
> >>>>
> >>>> As has been brought up, the issue we're facing is that distros have
> >>>> CONFIG_AUDIT=y and hence the above is the best real world case outside
> >>>> of people doing custom kernels. My question would then be how much
> >>>> overhead the above will add, considering it's an entry/exit call per op.
> >>>> If auditctl is turned off, what is the expectation in turns of overhead?
> >>>
> >>> I commented on that case in my last email to Pavel, but I'll try to go
> >>> over it again in a little more detail.
> >>>
> >>> As we discussed earlier in this thread, we can skip the req->opcode
> >>> check before both the _entry and _exit calls, so we are left with just
> >>> the bare audit calls in the io_uring code.  As the _entry and _exit
> >>> functions are small, I've copied them and their supporting functions
> >>> below and I'll try to explain what would happen in CONFIG_AUDIT=y,
> >>> "task,never" case.
> >>>
> >>> +  static inline struct audit_context *audit_context(void)
> >>> +  {
> >>> +    return current->audit_context;
> >>> +  }
> >>>
> >>> +  static inline bool audit_dummy_context(void)
> >>> +  {
> >>> +    void *p = audit_context();
> >>> +    return !p || *(int *)p;
> >>> +  }
> >>>
> >>> +  static inline void audit_uring_entry(u8 op)
> >>> +  {
> >>> +    if (unlikely(audit_enabled && audit_context()))
> >>> +      __audit_uring_entry(op);
> >>> +  }
> >>>
> >>> We have one if statement where the conditional checks on two
> >>> individual conditions.  The first (audit_enabled) is simply a check to
> >>> see if anyone has "turned on" auditing at runtime; historically this
> >>> worked rather well, and still does in a number of places, but ever
> >>> since systemd has taken to forcing audit on regardless of the admin's
> >>> audit configuration it is less useful.  The second (audit_context())
> >>> is a check to see if an audit_context has been allocated for the
> >>> current task.  In the case of "task,never" current->audit_context will
> >>> be NULL (see audit_alloc()) and the __audit_uring_entry() slowpath
> >>> will never be called.
> >>>
> >>> Worst case here is checking the value of audit_enabled and
> >>> current->audit_context.  Depending on which you think is more likely
> >>> we can change the order of the check so that the
> >>> current->audit_context check is first if you feel that is more likely
> >>> to be NULL than audit_enabled is to be false (it may be that way now).
> >>>
> >>> +  static inline void audit_uring_exit(int success, long code)
> >>> +  {
> >>> +    if (unlikely(!audit_dummy_context()))
> >>> +      __audit_uring_exit(success, code);
> >>> +  }
> >>>
> >>> The exit call is very similar to the entry call, but in the
> >>> "task,never" case it is very simple as the first check to be performed
> >>> is the current->audit_context check which we know to be NULL.  The
> >>> __audit_uring_exit() slowpath will never be called.
> >>
> >> I actually ran some numbers this morning. The test base is 5.13+, and
> >> CONFIG_AUDIT=y and CONFIG_AUDITSYSCALL=y is set for both the baseline
> >> test and the test with this series applied. I used your git branch as of
> >> this morning.
> >>
> >> The test case is my usual peak perf test, which is random reads at
> >> QD=128 and using polled IO. It's a single core test, not threaded. I ran
> >> two different tests - one was having a thread just do the IO, the other
> >> is using SQPOLL to do the IO for us. The device is capable than more
> >> IOPS than a single core can deliver, so we're CPU limited in this test.
> >> Hence it's a good test case as it does actual work, and shows software
> >> overhead quite nicely. Runs are very stable (less than 0.5% difference
> >> between runs on the same base), yet I did average 4 runs.
> >>
> >> Kernel               SQPOLL          IOPS            Perf diff
> >> ---------------------------------------------------------
> >> 5.13         0               3029872         0.0%
> >> 5.13         1               3031056         0.0%
> >> 5.13 + audit 0               2894160         -4.5%
> >> 5.13 + audit 1               2886168         -4.8%
> >>
> >> That's an immediate drop in perf of almost 5%. Looking at a quick
> >> profile of it (nothing fancy, just checking for 'audit' in the profile)
> >> shows this:
> >>
> >> +    2.17%  io_uring  [kernel.vmlinux]  [k] __audit_uring_entry
> >> +    0.71%  io_uring  [kernel.vmlinux]  [k] __audit_uring_exit
> >>      0.07%  io_uring  [kernel.vmlinux]  [k] __audit_syscall_entry
> >>      0.02%  io_uring  [kernel.vmlinux]  [k] __audit_syscall_exit
> >>
> >> Note that this is with _no_ rules!
> >
> > io_uring also supports a NOP command, which basically just measures
> > reqs/sec through the interface. Ran that as well:
> >
> > Kernel                SQPOLL          IOPS            Perf diff
> > ---------------------------------------------------------
> > 5.13          0               31.05M          0.0%
> > 5.13 + audit  0               25.31M          -18.5%
> >
> > and profile for the latter includes:
> >
> > +    5.19%  io_uring  [kernel.vmlinux]  [k] __audit_uring_entry
> > +    4.31%  io_uring  [kernel.vmlinux]  [k] __audit_uring_exit
> >      0.26%  io_uring  [kernel.vmlinux]  [k] __audit_syscall_entry
> >      0.08%  io_uring  [kernel.vmlinux]  [k] __audit_syscall_exit
>
> As Pavel correctly pointed it, looks like auditing is enabled. And
> indeed it was! Hence the above numbers is without having turned off
> auditing. Running the NOPs after having turned off audit, we get 30.6M
> IOPS, which is down about 1.5% from the baseline. The results for the
> polled random read test above did _not_ change from this, they are still
> down the same amount.
>
> Note, and I should have included this in the first email, this is not
> any kind of argument for or against audit logging. It's purely meant to
> be a set of numbers that show how the current series impacts
> performance.

Thanks for running some tests Jens, unfortunately the git tree hadn't
been updated to reflect the improved audit_uring_entry() and
audit_uring_exit() functions as we were still discussing things and I
thought there still might be some changes.  I just now updated the
branch with the latest code so if you have the cycles (ha!) to run
another perf test I think those numbers would be more interesting.
For example, I don't believe you should see __audit_uring_entry() or
__audit_uring_exit() at all if you have the audit "task,never" rule
loaded; you will see audit_uring_entry() and audit_uring_exit() but as
we discussed previously those should just be a single
"current->audit_context != NULL" check.

Also, I want to pull back a bit as I think there is confusion about
how audit works and why these changes are necessary.  As everyone
likely knows, there are audit calls sprinkled throughout the kernel
code, e.g. audit_log_format() and similar.  In addition to these calls
that most are aware of, there are setup/teardown audit calls that run
at task creation and destruction as well as at syscall entry and exit.
It is these lesser known calls that are responsible for the filtering,
setup, and general management of the audit context state in the
system; they also handle generation of some audit records such as
SYSCALL, PATH, etc. based on information that is recorded by audit
calls inserted into other places in the kernel.  The
audit_alloc_kernel() and audit_free() calls we are adding to the
io_uring/io-wq code are intended to do similar things to the existing
audit task creation/destruction hooks, but for the io_uring/io-wq
kernel threads.  Similarly the audit_uring_entry() and
audit_uring_exit() calls are intended to act as replacements for the
syscall entry and exit code.  If we want the existing audit calls in
the kernel to work properly, we need these setup/teardown functions.

Hopefully that makes things a bit more clear regarding these calls in
the io_uring/io-wq code.

Another point I wanted to address is the "double audit" (!!!) that has
been mentioned recently in this thread.  Don't get too excited, this
isn't quite what you think it is, it is a side effect of how io_uring
can dispatch certain operations.  As I think the io_uring folks
already know, io_uring can process I/O ops in several different
contexts; one of which is after a syscall completes but before the
kernel returns to userspace.  In this particular case things can get
rather interesting from an audit perspective as we need to handle both
the syscall audit records *and* the io_uring operation records.  If
you look closer at the audit code in this patchset you'll see some of
the fun stuff we need to do to make sure this case is handled
correctly.  If you happen to see a code path that takes you through
audit_syscall_entry() + <syscall_stuff> + audit_uring_entry() +
<io_uring_stuff> + audit_uring_exit() + audit_syscall_exit() rest
assured that is correct :)

Of course there is the normal audit_uring_entry() and
audit_uring_exit() without the audit syscall hooks in other code
paths, e.g. SQPOLL, but those are less dramatic than the "OMG, double
audit!!!" ;)
Paul Moore May 26, 2021, 6:44 p.m. UTC | #22
On Wed, May 26, 2021 at 2:01 PM Jens Axboe <axboe@kernel.dk> wrote:
> On 5/26/21 11:54 AM, Jens Axboe wrote:
> > On 5/26/21 11:31 AM, Jens Axboe wrote:
> >> On 5/26/21 11:15 AM, Jens Axboe wrote:
> >>> On 5/25/21 8:04 PM, Paul Moore wrote:
> >>>> On Tue, May 25, 2021 at 9:11 PM Jens Axboe <axboe@kernel.dk> wrote:
> >>>>> On 5/24/21 1:59 PM, Paul Moore wrote:
> >>>>>> That said, audit is not for everyone, and we have build time and
> >>>>>> runtime options to help make life easier.  Beyond simply disabling
> >>>>>> audit at compile time a number of Linux distributions effectively
> >>>>>> shortcut audit at runtime by adding a "never" rule to the audit
> >>>>>> filter, for example:
> >>>>>>
> >>>>>>  % auditctl -a task,never
> >>>>>
> >>>>> As has been brought up, the issue we're facing is that distros have
> >>>>> CONFIG_AUDIT=y and hence the above is the best real world case outside
> >>>>> of people doing custom kernels. My question would then be how much
> >>>>> overhead the above will add, considering it's an entry/exit call per op.
> >>>>> If auditctl is turned off, what is the expectation in turns of overhead?
> >>>>
> >>>> I commented on that case in my last email to Pavel, but I'll try to go
> >>>> over it again in a little more detail.
> >>>>
> >>>> As we discussed earlier in this thread, we can skip the req->opcode
> >>>> check before both the _entry and _exit calls, so we are left with just
> >>>> the bare audit calls in the io_uring code.  As the _entry and _exit
> >>>> functions are small, I've copied them and their supporting functions
> >>>> below and I'll try to explain what would happen in CONFIG_AUDIT=y,
> >>>> "task,never" case.
> >>>>
> >>>> +  static inline struct audit_context *audit_context(void)
> >>>> +  {
> >>>> +    return current->audit_context;
> >>>> +  }
> >>>>
> >>>> +  static inline bool audit_dummy_context(void)
> >>>> +  {
> >>>> +    void *p = audit_context();
> >>>> +    return !p || *(int *)p;
> >>>> +  }
> >>>>
> >>>> +  static inline void audit_uring_entry(u8 op)
> >>>> +  {
> >>>> +    if (unlikely(audit_enabled && audit_context()))
> >>>> +      __audit_uring_entry(op);
> >>>> +  }
> >>>>
> >>>> We have one if statement where the conditional checks on two
> >>>> individual conditions.  The first (audit_enabled) is simply a check to
> >>>> see if anyone has "turned on" auditing at runtime; historically this
> >>>> worked rather well, and still does in a number of places, but ever
> >>>> since systemd has taken to forcing audit on regardless of the admin's
> >>>> audit configuration it is less useful.  The second (audit_context())
> >>>> is a check to see if an audit_context has been allocated for the
> >>>> current task.  In the case of "task,never" current->audit_context will
> >>>> be NULL (see audit_alloc()) and the __audit_uring_entry() slowpath
> >>>> will never be called.
> >>>>
> >>>> Worst case here is checking the value of audit_enabled and
> >>>> current->audit_context.  Depending on which you think is more likely
> >>>> we can change the order of the check so that the
> >>>> current->audit_context check is first if you feel that is more likely
> >>>> to be NULL than audit_enabled is to be false (it may be that way now).
> >>>>
> >>>> +  static inline void audit_uring_exit(int success, long code)
> >>>> +  {
> >>>> +    if (unlikely(!audit_dummy_context()))
> >>>> +      __audit_uring_exit(success, code);
> >>>> +  }
> >>>>
> >>>> The exit call is very similar to the entry call, but in the
> >>>> "task,never" case it is very simple as the first check to be performed
> >>>> is the current->audit_context check which we know to be NULL.  The
> >>>> __audit_uring_exit() slowpath will never be called.
> >>>
> >>> I actually ran some numbers this morning. The test base is 5.13+, and
> >>> CONFIG_AUDIT=y and CONFIG_AUDITSYSCALL=y is set for both the baseline
> >>> test and the test with this series applied. I used your git branch as of
> >>> this morning.
> >>>
> >>> The test case is my usual peak perf test, which is random reads at
> >>> QD=128 and using polled IO. It's a single core test, not threaded. I ran
> >>> two different tests - one was having a thread just do the IO, the other
> >>> is using SQPOLL to do the IO for us. The device is capable than more
> >>> IOPS than a single core can deliver, so we're CPU limited in this test.
> >>> Hence it's a good test case as it does actual work, and shows software
> >>> overhead quite nicely. Runs are very stable (less than 0.5% difference
> >>> between runs on the same base), yet I did average 4 runs.
> >>>
> >>> Kernel              SQPOLL          IOPS            Perf diff
> >>> ---------------------------------------------------------
> >>> 5.13                0               3029872         0.0%
> >>> 5.13                1               3031056         0.0%
> >>> 5.13 + audit        0               2894160         -4.5%
> >>> 5.13 + audit        1               2886168         -4.8%
> >>>
> >>> That's an immediate drop in perf of almost 5%. Looking at a quick
> >>> profile of it (nothing fancy, just checking for 'audit' in the profile)
> >>> shows this:
> >>>
> >>> +    2.17%  io_uring  [kernel.vmlinux]  [k] __audit_uring_entry
> >>> +    0.71%  io_uring  [kernel.vmlinux]  [k] __audit_uring_exit
> >>>      0.07%  io_uring  [kernel.vmlinux]  [k] __audit_syscall_entry
> >>>      0.02%  io_uring  [kernel.vmlinux]  [k] __audit_syscall_exit
> >>>
> >>> Note that this is with _no_ rules!
> >>
> >> io_uring also supports a NOP command, which basically just measures
> >> reqs/sec through the interface. Ran that as well:
> >>
> >> Kernel               SQPOLL          IOPS            Perf diff
> >> ---------------------------------------------------------
> >> 5.13         0               31.05M          0.0%
> >> 5.13 + audit 0               25.31M          -18.5%
> >>
> >> and profile for the latter includes:
> >>
> >> +    5.19%  io_uring  [kernel.vmlinux]  [k] __audit_uring_entry
> >> +    4.31%  io_uring  [kernel.vmlinux]  [k] __audit_uring_exit
> >>      0.26%  io_uring  [kernel.vmlinux]  [k] __audit_syscall_entry
> >>      0.08%  io_uring  [kernel.vmlinux]  [k] __audit_syscall_exit
> >
> > As Pavel correctly pointed it, looks like auditing is enabled. And
> > indeed it was! Hence the above numbers is without having turned off
> > auditing. Running the NOPs after having turned off audit, we get 30.6M
> > IOPS, which is down about 1.5% from the baseline. The results for the
> > polled random read test above did _not_ change from this, they are still
> > down the same amount.
> >
> > Note, and I should have included this in the first email, this is not
> > any kind of argument for or against audit logging. It's purely meant to
> > be a set of numbers that show how the current series impacts
> > performance.
>
> And finally, just checking if we make it optional per opcode if we see
> any real impact, and the answer is no. Using the below patch which
> effectively bypasses audit calls unless the opcode has flagged the need
> to do so, I cannot measure any difference in perf (as expected).
>
> To turn this into something useful, my suggestion as a viable path
> forward would be:
>
> 1) Use something like the below patch and flag request types that we
>    want to do audit logging for.
>
> 2) As Pavel suggested, eliminate the need for having both and entry/exit
>    hook, turning it into just one. That effectively cuts the number of
>    checks and calls in half.

I suspect the updated working-io_uring branch with HEAD at
1f25193a3f54 (updated a short time ago, see my last email in this
thread) will improve performance.  Also, as has been mention several
times now, for audit to work we need both the _entry and _exit call.
Pavel Begunkov May 26, 2021, 6:57 p.m. UTC | #23
On 5/26/21 7:44 PM, Paul Moore wrote:
> On Wed, May 26, 2021 at 2:01 PM Jens Axboe <axboe@kernel.dk> wrote:
>> On 5/26/21 11:54 AM, Jens Axboe wrote:
>>> On 5/26/21 11:31 AM, Jens Axboe wrote:
>>>> On 5/26/21 11:15 AM, Jens Axboe wrote:
>>>>> On 5/25/21 8:04 PM, Paul Moore wrote:
>>>>>> On Tue, May 25, 2021 at 9:11 PM Jens Axboe <axboe@kernel.dk> wrote:
>>>>>>> On 5/24/21 1:59 PM, Paul Moore wrote:
>>>>>>>> That said, audit is not for everyone, and we have build time and
>>>>>>>> runtime options to help make life easier.  Beyond simply disabling
>>>>>>>> audit at compile time a number of Linux distributions effectively
>>>>>>>> shortcut audit at runtime by adding a "never" rule to the audit
>>>>>>>> filter, for example:
>>>>>>>>
>>>>>>>>  % auditctl -a task,never
>>>>>>>
>>>>>>> As has been brought up, the issue we're facing is that distros have
>>>>>>> CONFIG_AUDIT=y and hence the above is the best real world case outside
>>>>>>> of people doing custom kernels. My question would then be how much
>>>>>>> overhead the above will add, considering it's an entry/exit call per op.
>>>>>>> If auditctl is turned off, what is the expectation in turns of overhead?
>>>>>>
>>>>>> I commented on that case in my last email to Pavel, but I'll try to go
>>>>>> over it again in a little more detail.
>>>>>>
>>>>>> As we discussed earlier in this thread, we can skip the req->opcode
>>>>>> check before both the _entry and _exit calls, so we are left with just
>>>>>> the bare audit calls in the io_uring code.  As the _entry and _exit
>>>>>> functions are small, I've copied them and their supporting functions
>>>>>> below and I'll try to explain what would happen in CONFIG_AUDIT=y,
>>>>>> "task,never" case.
>>>>>>
>>>>>> +  static inline struct audit_context *audit_context(void)
>>>>>> +  {
>>>>>> +    return current->audit_context;
>>>>>> +  }
>>>>>>
>>>>>> +  static inline bool audit_dummy_context(void)
>>>>>> +  {
>>>>>> +    void *p = audit_context();
>>>>>> +    return !p || *(int *)p;
>>>>>> +  }
>>>>>>
>>>>>> +  static inline void audit_uring_entry(u8 op)
>>>>>> +  {
>>>>>> +    if (unlikely(audit_enabled && audit_context()))
>>>>>> +      __audit_uring_entry(op);
>>>>>> +  }
>>>>>>
>>>>>> We have one if statement where the conditional checks on two
>>>>>> individual conditions.  The first (audit_enabled) is simply a check to
>>>>>> see if anyone has "turned on" auditing at runtime; historically this
>>>>>> worked rather well, and still does in a number of places, but ever
>>>>>> since systemd has taken to forcing audit on regardless of the admin's
>>>>>> audit configuration it is less useful.  The second (audit_context())
>>>>>> is a check to see if an audit_context has been allocated for the
>>>>>> current task.  In the case of "task,never" current->audit_context will
>>>>>> be NULL (see audit_alloc()) and the __audit_uring_entry() slowpath
>>>>>> will never be called.
>>>>>>
>>>>>> Worst case here is checking the value of audit_enabled and
>>>>>> current->audit_context.  Depending on which you think is more likely
>>>>>> we can change the order of the check so that the
>>>>>> current->audit_context check is first if you feel that is more likely
>>>>>> to be NULL than audit_enabled is to be false (it may be that way now).
>>>>>>
>>>>>> +  static inline void audit_uring_exit(int success, long code)
>>>>>> +  {
>>>>>> +    if (unlikely(!audit_dummy_context()))
>>>>>> +      __audit_uring_exit(success, code);
>>>>>> +  }
>>>>>>
>>>>>> The exit call is very similar to the entry call, but in the
>>>>>> "task,never" case it is very simple as the first check to be performed
>>>>>> is the current->audit_context check which we know to be NULL.  The
>>>>>> __audit_uring_exit() slowpath will never be called.
>>>>>
>>>>> I actually ran some numbers this morning. The test base is 5.13+, and
>>>>> CONFIG_AUDIT=y and CONFIG_AUDITSYSCALL=y is set for both the baseline
>>>>> test and the test with this series applied. I used your git branch as of
>>>>> this morning.
>>>>>
>>>>> The test case is my usual peak perf test, which is random reads at
>>>>> QD=128 and using polled IO. It's a single core test, not threaded. I ran
>>>>> two different tests - one was having a thread just do the IO, the other
>>>>> is using SQPOLL to do the IO for us. The device is capable than more
>>>>> IOPS than a single core can deliver, so we're CPU limited in this test.
>>>>> Hence it's a good test case as it does actual work, and shows software
>>>>> overhead quite nicely. Runs are very stable (less than 0.5% difference
>>>>> between runs on the same base), yet I did average 4 runs.
>>>>>
>>>>> Kernel              SQPOLL          IOPS            Perf diff
>>>>> ---------------------------------------------------------
>>>>> 5.13                0               3029872         0.0%
>>>>> 5.13                1               3031056         0.0%
>>>>> 5.13 + audit        0               2894160         -4.5%
>>>>> 5.13 + audit        1               2886168         -4.8%
>>>>>
>>>>> That's an immediate drop in perf of almost 5%. Looking at a quick
>>>>> profile of it (nothing fancy, just checking for 'audit' in the profile)
>>>>> shows this:
>>>>>
>>>>> +    2.17%  io_uring  [kernel.vmlinux]  [k] __audit_uring_entry
>>>>> +    0.71%  io_uring  [kernel.vmlinux]  [k] __audit_uring_exit
>>>>>      0.07%  io_uring  [kernel.vmlinux]  [k] __audit_syscall_entry
>>>>>      0.02%  io_uring  [kernel.vmlinux]  [k] __audit_syscall_exit
>>>>>
>>>>> Note that this is with _no_ rules!
>>>>
>>>> io_uring also supports a NOP command, which basically just measures
>>>> reqs/sec through the interface. Ran that as well:
>>>>
>>>> Kernel               SQPOLL          IOPS            Perf diff
>>>> ---------------------------------------------------------
>>>> 5.13         0               31.05M          0.0%
>>>> 5.13 + audit 0               25.31M          -18.5%
>>>>
>>>> and profile for the latter includes:
>>>>
>>>> +    5.19%  io_uring  [kernel.vmlinux]  [k] __audit_uring_entry
>>>> +    4.31%  io_uring  [kernel.vmlinux]  [k] __audit_uring_exit
>>>>      0.26%  io_uring  [kernel.vmlinux]  [k] __audit_syscall_entry
>>>>      0.08%  io_uring  [kernel.vmlinux]  [k] __audit_syscall_exit
>>>
>>> As Pavel correctly pointed it, looks like auditing is enabled. And
>>> indeed it was! Hence the above numbers is without having turned off
>>> auditing. Running the NOPs after having turned off audit, we get 30.6M
>>> IOPS, which is down about 1.5% from the baseline. The results for the
>>> polled random read test above did _not_ change from this, they are still
>>> down the same amount.
>>>
>>> Note, and I should have included this in the first email, this is not
>>> any kind of argument for or against audit logging. It's purely meant to
>>> be a set of numbers that show how the current series impacts
>>> performance.
>>
>> And finally, just checking if we make it optional per opcode if we see
>> any real impact, and the answer is no. Using the below patch which
>> effectively bypasses audit calls unless the opcode has flagged the need
>> to do so, I cannot measure any difference in perf (as expected).
>>
>> To turn this into something useful, my suggestion as a viable path
>> forward would be:
>>
>> 1) Use something like the below patch and flag request types that we
>>    want to do audit logging for.
>>
>> 2) As Pavel suggested, eliminate the need for having both and entry/exit
>>    hook, turning it into just one. That effectively cuts the number of
>>    checks and calls in half.
> 
> I suspect the updated working-io_uring branch with HEAD at
> 1f25193a3f54 (updated a short time ago, see my last email in this
> thread) will improve performance.  Also, as has been mention several

See the email you replied to, ~1.5% was basically an overhead of
two `if (io_op_defs[req->opcode].audit)` in case of nops, where at
least once req->opcode is cached. But to be completely fair, misses
unlikely
Paul Moore May 26, 2021, 7:10 p.m. UTC | #24
On Wed, May 26, 2021 at 2:57 PM Pavel Begunkov <asml.silence@gmail.com> wrote:
> On 5/26/21 7:44 PM, Paul Moore wrote:
> > On Wed, May 26, 2021 at 2:01 PM Jens Axboe <axboe@kernel.dk> wrote:
> >> On 5/26/21 11:54 AM, Jens Axboe wrote:
> >>> On 5/26/21 11:31 AM, Jens Axboe wrote:
> >>>> On 5/26/21 11:15 AM, Jens Axboe wrote:
> >>>>> On 5/25/21 8:04 PM, Paul Moore wrote:
> >>>>>> On Tue, May 25, 2021 at 9:11 PM Jens Axboe <axboe@kernel.dk> wrote:
> >>>>>>> On 5/24/21 1:59 PM, Paul Moore wrote:
> >>>>>>>> That said, audit is not for everyone, and we have build time and
> >>>>>>>> runtime options to help make life easier.  Beyond simply disabling
> >>>>>>>> audit at compile time a number of Linux distributions effectively
> >>>>>>>> shortcut audit at runtime by adding a "never" rule to the audit
> >>>>>>>> filter, for example:
> >>>>>>>>
> >>>>>>>>  % auditctl -a task,never
> >>>>>>>
> >>>>>>> As has been brought up, the issue we're facing is that distros have
> >>>>>>> CONFIG_AUDIT=y and hence the above is the best real world case outside
> >>>>>>> of people doing custom kernels. My question would then be how much
> >>>>>>> overhead the above will add, considering it's an entry/exit call per op.
> >>>>>>> If auditctl is turned off, what is the expectation in turns of overhead?
> >>>>>>
> >>>>>> I commented on that case in my last email to Pavel, but I'll try to go
> >>>>>> over it again in a little more detail.
> >>>>>>
> >>>>>> As we discussed earlier in this thread, we can skip the req->opcode
> >>>>>> check before both the _entry and _exit calls, so we are left with just
> >>>>>> the bare audit calls in the io_uring code.  As the _entry and _exit
> >>>>>> functions are small, I've copied them and their supporting functions
> >>>>>> below and I'll try to explain what would happen in CONFIG_AUDIT=y,
> >>>>>> "task,never" case.
> >>>>>>
> >>>>>> +  static inline struct audit_context *audit_context(void)
> >>>>>> +  {
> >>>>>> +    return current->audit_context;
> >>>>>> +  }
> >>>>>>
> >>>>>> +  static inline bool audit_dummy_context(void)
> >>>>>> +  {
> >>>>>> +    void *p = audit_context();
> >>>>>> +    return !p || *(int *)p;
> >>>>>> +  }
> >>>>>>
> >>>>>> +  static inline void audit_uring_entry(u8 op)
> >>>>>> +  {
> >>>>>> +    if (unlikely(audit_enabled && audit_context()))
> >>>>>> +      __audit_uring_entry(op);
> >>>>>> +  }
> >>>>>>
> >>>>>> We have one if statement where the conditional checks on two
> >>>>>> individual conditions.  The first (audit_enabled) is simply a check to
> >>>>>> see if anyone has "turned on" auditing at runtime; historically this
> >>>>>> worked rather well, and still does in a number of places, but ever
> >>>>>> since systemd has taken to forcing audit on regardless of the admin's
> >>>>>> audit configuration it is less useful.  The second (audit_context())
> >>>>>> is a check to see if an audit_context has been allocated for the
> >>>>>> current task.  In the case of "task,never" current->audit_context will
> >>>>>> be NULL (see audit_alloc()) and the __audit_uring_entry() slowpath
> >>>>>> will never be called.
> >>>>>>
> >>>>>> Worst case here is checking the value of audit_enabled and
> >>>>>> current->audit_context.  Depending on which you think is more likely
> >>>>>> we can change the order of the check so that the
> >>>>>> current->audit_context check is first if you feel that is more likely
> >>>>>> to be NULL than audit_enabled is to be false (it may be that way now).
> >>>>>>
> >>>>>> +  static inline void audit_uring_exit(int success, long code)
> >>>>>> +  {
> >>>>>> +    if (unlikely(!audit_dummy_context()))
> >>>>>> +      __audit_uring_exit(success, code);
> >>>>>> +  }
> >>>>>>
> >>>>>> The exit call is very similar to the entry call, but in the
> >>>>>> "task,never" case it is very simple as the first check to be performed
> >>>>>> is the current->audit_context check which we know to be NULL.  The
> >>>>>> __audit_uring_exit() slowpath will never be called.
> >>>>>
> >>>>> I actually ran some numbers this morning. The test base is 5.13+, and
> >>>>> CONFIG_AUDIT=y and CONFIG_AUDITSYSCALL=y is set for both the baseline
> >>>>> test and the test with this series applied. I used your git branch as of
> >>>>> this morning.
> >>>>>
> >>>>> The test case is my usual peak perf test, which is random reads at
> >>>>> QD=128 and using polled IO. It's a single core test, not threaded. I ran
> >>>>> two different tests - one was having a thread just do the IO, the other
> >>>>> is using SQPOLL to do the IO for us. The device is capable than more
> >>>>> IOPS than a single core can deliver, so we're CPU limited in this test.
> >>>>> Hence it's a good test case as it does actual work, and shows software
> >>>>> overhead quite nicely. Runs are very stable (less than 0.5% difference
> >>>>> between runs on the same base), yet I did average 4 runs.
> >>>>>
> >>>>> Kernel              SQPOLL          IOPS            Perf diff
> >>>>> ---------------------------------------------------------
> >>>>> 5.13                0               3029872         0.0%
> >>>>> 5.13                1               3031056         0.0%
> >>>>> 5.13 + audit        0               2894160         -4.5%
> >>>>> 5.13 + audit        1               2886168         -4.8%
> >>>>>
> >>>>> That's an immediate drop in perf of almost 5%. Looking at a quick
> >>>>> profile of it (nothing fancy, just checking for 'audit' in the profile)
> >>>>> shows this:
> >>>>>
> >>>>> +    2.17%  io_uring  [kernel.vmlinux]  [k] __audit_uring_entry
> >>>>> +    0.71%  io_uring  [kernel.vmlinux]  [k] __audit_uring_exit
> >>>>>      0.07%  io_uring  [kernel.vmlinux]  [k] __audit_syscall_entry
> >>>>>      0.02%  io_uring  [kernel.vmlinux]  [k] __audit_syscall_exit
> >>>>>
> >>>>> Note that this is with _no_ rules!
> >>>>
> >>>> io_uring also supports a NOP command, which basically just measures
> >>>> reqs/sec through the interface. Ran that as well:
> >>>>
> >>>> Kernel               SQPOLL          IOPS            Perf diff
> >>>> ---------------------------------------------------------
> >>>> 5.13         0               31.05M          0.0%
> >>>> 5.13 + audit 0               25.31M          -18.5%
> >>>>
> >>>> and profile for the latter includes:
> >>>>
> >>>> +    5.19%  io_uring  [kernel.vmlinux]  [k] __audit_uring_entry
> >>>> +    4.31%  io_uring  [kernel.vmlinux]  [k] __audit_uring_exit
> >>>>      0.26%  io_uring  [kernel.vmlinux]  [k] __audit_syscall_entry
> >>>>      0.08%  io_uring  [kernel.vmlinux]  [k] __audit_syscall_exit
> >>>
> >>> As Pavel correctly pointed it, looks like auditing is enabled. And
> >>> indeed it was! Hence the above numbers is without having turned off
> >>> auditing. Running the NOPs after having turned off audit, we get 30.6M
> >>> IOPS, which is down about 1.5% from the baseline. The results for the
> >>> polled random read test above did _not_ change from this, they are still
> >>> down the same amount.
> >>>
> >>> Note, and I should have included this in the first email, this is not
> >>> any kind of argument for or against audit logging. It's purely meant to
> >>> be a set of numbers that show how the current series impacts
> >>> performance.
> >>
> >> And finally, just checking if we make it optional per opcode if we see
> >> any real impact, and the answer is no. Using the below patch which
> >> effectively bypasses audit calls unless the opcode has flagged the need
> >> to do so, I cannot measure any difference in perf (as expected).
> >>
> >> To turn this into something useful, my suggestion as a viable path
> >> forward would be:
> >>
> >> 1) Use something like the below patch and flag request types that we
> >>    want to do audit logging for.
> >>
> >> 2) As Pavel suggested, eliminate the need for having both and entry/exit
> >>    hook, turning it into just one. That effectively cuts the number of
> >>    checks and calls in half.
> >
> > I suspect the updated working-io_uring branch with HEAD at
> > 1f25193a3f54 (updated a short time ago, see my last email in this
> > thread) will improve performance.  Also, as has been mention several
>
> See the email you replied to, ~1.5% was basically an overhead of
> two `if (io_op_defs[req->opcode].audit)` in case of nops, where at
> least once req->opcode is cached. But to be completely fair, misses
> unlikely

Maybe.  I remain skeptical that "io_op_defs[req->opcode].audit" has
the same cost as "unlikely(audit_context())".
Jens Axboe May 26, 2021, 7:44 p.m. UTC | #25
On 5/26/21 12:44 PM, Paul Moore wrote:
> On Wed, May 26, 2021 at 2:01 PM Jens Axboe <axboe@kernel.dk> wrote:
>> On 5/26/21 11:54 AM, Jens Axboe wrote:
>>> On 5/26/21 11:31 AM, Jens Axboe wrote:
>>>> On 5/26/21 11:15 AM, Jens Axboe wrote:
>>>>> On 5/25/21 8:04 PM, Paul Moore wrote:
>>>>>> On Tue, May 25, 2021 at 9:11 PM Jens Axboe <axboe@kernel.dk> wrote:
>>>>>>> On 5/24/21 1:59 PM, Paul Moore wrote:
>>>>>>>> That said, audit is not for everyone, and we have build time and
>>>>>>>> runtime options to help make life easier.  Beyond simply disabling
>>>>>>>> audit at compile time a number of Linux distributions effectively
>>>>>>>> shortcut audit at runtime by adding a "never" rule to the audit
>>>>>>>> filter, for example:
>>>>>>>>
>>>>>>>>  % auditctl -a task,never
>>>>>>>
>>>>>>> As has been brought up, the issue we're facing is that distros have
>>>>>>> CONFIG_AUDIT=y and hence the above is the best real world case outside
>>>>>>> of people doing custom kernels. My question would then be how much
>>>>>>> overhead the above will add, considering it's an entry/exit call per op.
>>>>>>> If auditctl is turned off, what is the expectation in turns of overhead?
>>>>>>
>>>>>> I commented on that case in my last email to Pavel, but I'll try to go
>>>>>> over it again in a little more detail.
>>>>>>
>>>>>> As we discussed earlier in this thread, we can skip the req->opcode
>>>>>> check before both the _entry and _exit calls, so we are left with just
>>>>>> the bare audit calls in the io_uring code.  As the _entry and _exit
>>>>>> functions are small, I've copied them and their supporting functions
>>>>>> below and I'll try to explain what would happen in CONFIG_AUDIT=y,
>>>>>> "task,never" case.
>>>>>>
>>>>>> +  static inline struct audit_context *audit_context(void)
>>>>>> +  {
>>>>>> +    return current->audit_context;
>>>>>> +  }
>>>>>>
>>>>>> +  static inline bool audit_dummy_context(void)
>>>>>> +  {
>>>>>> +    void *p = audit_context();
>>>>>> +    return !p || *(int *)p;
>>>>>> +  }
>>>>>>
>>>>>> +  static inline void audit_uring_entry(u8 op)
>>>>>> +  {
>>>>>> +    if (unlikely(audit_enabled && audit_context()))
>>>>>> +      __audit_uring_entry(op);
>>>>>> +  }
>>>>>>
>>>>>> We have one if statement where the conditional checks on two
>>>>>> individual conditions.  The first (audit_enabled) is simply a check to
>>>>>> see if anyone has "turned on" auditing at runtime; historically this
>>>>>> worked rather well, and still does in a number of places, but ever
>>>>>> since systemd has taken to forcing audit on regardless of the admin's
>>>>>> audit configuration it is less useful.  The second (audit_context())
>>>>>> is a check to see if an audit_context has been allocated for the
>>>>>> current task.  In the case of "task,never" current->audit_context will
>>>>>> be NULL (see audit_alloc()) and the __audit_uring_entry() slowpath
>>>>>> will never be called.
>>>>>>
>>>>>> Worst case here is checking the value of audit_enabled and
>>>>>> current->audit_context.  Depending on which you think is more likely
>>>>>> we can change the order of the check so that the
>>>>>> current->audit_context check is first if you feel that is more likely
>>>>>> to be NULL than audit_enabled is to be false (it may be that way now).
>>>>>>
>>>>>> +  static inline void audit_uring_exit(int success, long code)
>>>>>> +  {
>>>>>> +    if (unlikely(!audit_dummy_context()))
>>>>>> +      __audit_uring_exit(success, code);
>>>>>> +  }
>>>>>>
>>>>>> The exit call is very similar to the entry call, but in the
>>>>>> "task,never" case it is very simple as the first check to be performed
>>>>>> is the current->audit_context check which we know to be NULL.  The
>>>>>> __audit_uring_exit() slowpath will never be called.
>>>>>
>>>>> I actually ran some numbers this morning. The test base is 5.13+, and
>>>>> CONFIG_AUDIT=y and CONFIG_AUDITSYSCALL=y is set for both the baseline
>>>>> test and the test with this series applied. I used your git branch as of
>>>>> this morning.
>>>>>
>>>>> The test case is my usual peak perf test, which is random reads at
>>>>> QD=128 and using polled IO. It's a single core test, not threaded. I ran
>>>>> two different tests - one was having a thread just do the IO, the other
>>>>> is using SQPOLL to do the IO for us. The device is capable than more
>>>>> IOPS than a single core can deliver, so we're CPU limited in this test.
>>>>> Hence it's a good test case as it does actual work, and shows software
>>>>> overhead quite nicely. Runs are very stable (less than 0.5% difference
>>>>> between runs on the same base), yet I did average 4 runs.
>>>>>
>>>>> Kernel              SQPOLL          IOPS            Perf diff
>>>>> ---------------------------------------------------------
>>>>> 5.13                0               3029872         0.0%
>>>>> 5.13                1               3031056         0.0%
>>>>> 5.13 + audit        0               2894160         -4.5%
>>>>> 5.13 + audit        1               2886168         -4.8%
>>>>>
>>>>> That's an immediate drop in perf of almost 5%. Looking at a quick
>>>>> profile of it (nothing fancy, just checking for 'audit' in the profile)
>>>>> shows this:
>>>>>
>>>>> +    2.17%  io_uring  [kernel.vmlinux]  [k] __audit_uring_entry
>>>>> +    0.71%  io_uring  [kernel.vmlinux]  [k] __audit_uring_exit
>>>>>      0.07%  io_uring  [kernel.vmlinux]  [k] __audit_syscall_entry
>>>>>      0.02%  io_uring  [kernel.vmlinux]  [k] __audit_syscall_exit
>>>>>
>>>>> Note that this is with _no_ rules!
>>>>
>>>> io_uring also supports a NOP command, which basically just measures
>>>> reqs/sec through the interface. Ran that as well:
>>>>
>>>> Kernel               SQPOLL          IOPS            Perf diff
>>>> ---------------------------------------------------------
>>>> 5.13         0               31.05M          0.0%
>>>> 5.13 + audit 0               25.31M          -18.5%
>>>>
>>>> and profile for the latter includes:
>>>>
>>>> +    5.19%  io_uring  [kernel.vmlinux]  [k] __audit_uring_entry
>>>> +    4.31%  io_uring  [kernel.vmlinux]  [k] __audit_uring_exit
>>>>      0.26%  io_uring  [kernel.vmlinux]  [k] __audit_syscall_entry
>>>>      0.08%  io_uring  [kernel.vmlinux]  [k] __audit_syscall_exit
>>>
>>> As Pavel correctly pointed it, looks like auditing is enabled. And
>>> indeed it was! Hence the above numbers is without having turned off
>>> auditing. Running the NOPs after having turned off audit, we get 30.6M
>>> IOPS, which is down about 1.5% from the baseline. The results for the
>>> polled random read test above did _not_ change from this, they are still
>>> down the same amount.
>>>
>>> Note, and I should have included this in the first email, this is not
>>> any kind of argument for or against audit logging. It's purely meant to
>>> be a set of numbers that show how the current series impacts
>>> performance.
>>
>> And finally, just checking if we make it optional per opcode if we see
>> any real impact, and the answer is no. Using the below patch which
>> effectively bypasses audit calls unless the opcode has flagged the need
>> to do so, I cannot measure any difference in perf (as expected).
>>
>> To turn this into something useful, my suggestion as a viable path
>> forward would be:
>>
>> 1) Use something like the below patch and flag request types that we
>>    want to do audit logging for.
>>
>> 2) As Pavel suggested, eliminate the need for having both and entry/exit
>>    hook, turning it into just one. That effectively cuts the number of
>>    checks and calls in half.
> 
> I suspect the updated working-io_uring branch with HEAD at
> 1f25193a3f54 (updated a short time ago, see my last email in this
> thread) will improve performance.  Also, as has been mention several
> times now, for audit to work we need both the _entry and _exit call.

Pulled in your new stuff, and ran a quick test again (same as before,
the rand reads). No change, we're still down ~5% with auditctl -a
task,never having been run. If I stub out your two audit entry/exit
calls by adding an if (0) before them, then perf is back to normal.
Paul Moore May 26, 2021, 8:19 p.m. UTC | #26
On Wed, May 26, 2021 at 3:44 PM Jens Axboe <axboe@kernel.dk> wrote:
> On 5/26/21 12:44 PM, Paul Moore wrote:
> > On Wed, May 26, 2021 at 2:01 PM Jens Axboe <axboe@kernel.dk> wrote:
> >> On 5/26/21 11:54 AM, Jens Axboe wrote:
> >>> On 5/26/21 11:31 AM, Jens Axboe wrote:
> >>>> On 5/26/21 11:15 AM, Jens Axboe wrote:
> >>>>> On 5/25/21 8:04 PM, Paul Moore wrote:
> >>>>>> On Tue, May 25, 2021 at 9:11 PM Jens Axboe <axboe@kernel.dk> wrote:
> >>>>>>> On 5/24/21 1:59 PM, Paul Moore wrote:
> >>>>>>>> That said, audit is not for everyone, and we have build time and
> >>>>>>>> runtime options to help make life easier.  Beyond simply disabling
> >>>>>>>> audit at compile time a number of Linux distributions effectively
> >>>>>>>> shortcut audit at runtime by adding a "never" rule to the audit
> >>>>>>>> filter, for example:
> >>>>>>>>
> >>>>>>>>  % auditctl -a task,never
> >>>>>>>
> >>>>>>> As has been brought up, the issue we're facing is that distros have
> >>>>>>> CONFIG_AUDIT=y and hence the above is the best real world case outside
> >>>>>>> of people doing custom kernels. My question would then be how much
> >>>>>>> overhead the above will add, considering it's an entry/exit call per op.
> >>>>>>> If auditctl is turned off, what is the expectation in turns of overhead?
> >>>>>>
> >>>>>> I commented on that case in my last email to Pavel, but I'll try to go
> >>>>>> over it again in a little more detail.
> >>>>>>
> >>>>>> As we discussed earlier in this thread, we can skip the req->opcode
> >>>>>> check before both the _entry and _exit calls, so we are left with just
> >>>>>> the bare audit calls in the io_uring code.  As the _entry and _exit
> >>>>>> functions are small, I've copied them and their supporting functions
> >>>>>> below and I'll try to explain what would happen in CONFIG_AUDIT=y,
> >>>>>> "task,never" case.
> >>>>>>
> >>>>>> +  static inline struct audit_context *audit_context(void)
> >>>>>> +  {
> >>>>>> +    return current->audit_context;
> >>>>>> +  }
> >>>>>>
> >>>>>> +  static inline bool audit_dummy_context(void)
> >>>>>> +  {
> >>>>>> +    void *p = audit_context();
> >>>>>> +    return !p || *(int *)p;
> >>>>>> +  }
> >>>>>>
> >>>>>> +  static inline void audit_uring_entry(u8 op)
> >>>>>> +  {
> >>>>>> +    if (unlikely(audit_enabled && audit_context()))
> >>>>>> +      __audit_uring_entry(op);
> >>>>>> +  }
> >>>>>>
> >>>>>> We have one if statement where the conditional checks on two
> >>>>>> individual conditions.  The first (audit_enabled) is simply a check to
> >>>>>> see if anyone has "turned on" auditing at runtime; historically this
> >>>>>> worked rather well, and still does in a number of places, but ever
> >>>>>> since systemd has taken to forcing audit on regardless of the admin's
> >>>>>> audit configuration it is less useful.  The second (audit_context())
> >>>>>> is a check to see if an audit_context has been allocated for the
> >>>>>> current task.  In the case of "task,never" current->audit_context will
> >>>>>> be NULL (see audit_alloc()) and the __audit_uring_entry() slowpath
> >>>>>> will never be called.
> >>>>>>
> >>>>>> Worst case here is checking the value of audit_enabled and
> >>>>>> current->audit_context.  Depending on which you think is more likely
> >>>>>> we can change the order of the check so that the
> >>>>>> current->audit_context check is first if you feel that is more likely
> >>>>>> to be NULL than audit_enabled is to be false (it may be that way now).
> >>>>>>
> >>>>>> +  static inline void audit_uring_exit(int success, long code)
> >>>>>> +  {
> >>>>>> +    if (unlikely(!audit_dummy_context()))
> >>>>>> +      __audit_uring_exit(success, code);
> >>>>>> +  }
> >>>>>>
> >>>>>> The exit call is very similar to the entry call, but in the
> >>>>>> "task,never" case it is very simple as the first check to be performed
> >>>>>> is the current->audit_context check which we know to be NULL.  The
> >>>>>> __audit_uring_exit() slowpath will never be called.
> >>>>>
> >>>>> I actually ran some numbers this morning. The test base is 5.13+, and
> >>>>> CONFIG_AUDIT=y and CONFIG_AUDITSYSCALL=y is set for both the baseline
> >>>>> test and the test with this series applied. I used your git branch as of
> >>>>> this morning.
> >>>>>
> >>>>> The test case is my usual peak perf test, which is random reads at
> >>>>> QD=128 and using polled IO. It's a single core test, not threaded. I ran
> >>>>> two different tests - one was having a thread just do the IO, the other
> >>>>> is using SQPOLL to do the IO for us. The device is capable than more
> >>>>> IOPS than a single core can deliver, so we're CPU limited in this test.
> >>>>> Hence it's a good test case as it does actual work, and shows software
> >>>>> overhead quite nicely. Runs are very stable (less than 0.5% difference
> >>>>> between runs on the same base), yet I did average 4 runs.
> >>>>>
> >>>>> Kernel              SQPOLL          IOPS            Perf diff
> >>>>> ---------------------------------------------------------
> >>>>> 5.13                0               3029872         0.0%
> >>>>> 5.13                1               3031056         0.0%
> >>>>> 5.13 + audit        0               2894160         -4.5%
> >>>>> 5.13 + audit        1               2886168         -4.8%
> >>>>>
> >>>>> That's an immediate drop in perf of almost 5%. Looking at a quick
> >>>>> profile of it (nothing fancy, just checking for 'audit' in the profile)
> >>>>> shows this:
> >>>>>
> >>>>> +    2.17%  io_uring  [kernel.vmlinux]  [k] __audit_uring_entry
> >>>>> +    0.71%  io_uring  [kernel.vmlinux]  [k] __audit_uring_exit
> >>>>>      0.07%  io_uring  [kernel.vmlinux]  [k] __audit_syscall_entry
> >>>>>      0.02%  io_uring  [kernel.vmlinux]  [k] __audit_syscall_exit
> >>>>>
> >>>>> Note that this is with _no_ rules!
> >>>>
> >>>> io_uring also supports a NOP command, which basically just measures
> >>>> reqs/sec through the interface. Ran that as well:
> >>>>
> >>>> Kernel               SQPOLL          IOPS            Perf diff
> >>>> ---------------------------------------------------------
> >>>> 5.13         0               31.05M          0.0%
> >>>> 5.13 + audit 0               25.31M          -18.5%
> >>>>
> >>>> and profile for the latter includes:
> >>>>
> >>>> +    5.19%  io_uring  [kernel.vmlinux]  [k] __audit_uring_entry
> >>>> +    4.31%  io_uring  [kernel.vmlinux]  [k] __audit_uring_exit
> >>>>      0.26%  io_uring  [kernel.vmlinux]  [k] __audit_syscall_entry
> >>>>      0.08%  io_uring  [kernel.vmlinux]  [k] __audit_syscall_exit
> >>>
> >>> As Pavel correctly pointed it, looks like auditing is enabled. And
> >>> indeed it was! Hence the above numbers is without having turned off
> >>> auditing. Running the NOPs after having turned off audit, we get 30.6M
> >>> IOPS, which is down about 1.5% from the baseline. The results for the
> >>> polled random read test above did _not_ change from this, they are still
> >>> down the same amount.
> >>>
> >>> Note, and I should have included this in the first email, this is not
> >>> any kind of argument for or against audit logging. It's purely meant to
> >>> be a set of numbers that show how the current series impacts
> >>> performance.
> >>
> >> And finally, just checking if we make it optional per opcode if we see
> >> any real impact, and the answer is no. Using the below patch which
> >> effectively bypasses audit calls unless the opcode has flagged the need
> >> to do so, I cannot measure any difference in perf (as expected).
> >>
> >> To turn this into something useful, my suggestion as a viable path
> >> forward would be:
> >>
> >> 1) Use something like the below patch and flag request types that we
> >>    want to do audit logging for.
> >>
> >> 2) As Pavel suggested, eliminate the need for having both and entry/exit
> >>    hook, turning it into just one. That effectively cuts the number of
> >>    checks and calls in half.
> >
> > I suspect the updated working-io_uring branch with HEAD at
> > 1f25193a3f54 (updated a short time ago, see my last email in this
> > thread) will improve performance.  Also, as has been mention several
> > times now, for audit to work we need both the _entry and _exit call.
>
> Pulled in your new stuff, and ran a quick test again (same as before,
> the rand reads). No change, we're still down ~5% with auditctl -a
> task,never having been run. If I stub out your two audit entry/exit
> calls by adding an if (0) before them, then perf is back to normal.

That's fun.

Considering that we are down to basically a single if-conditional in
the audit code and your other tests with another single conditional
yielded no significant change it looks like we really only have one
option left before we "agree to disagree" and have to mark io_uring
and audit as mutually exclusive in Kconfig.  If we moved the _entry
and _exit calls into the individual operation case blocks (quick
openat example below) so that only certain operations were able to be
audited would that be acceptable assuming the high frequency ops were
untouched?  My initial gut feeling was that this would involve >50% of
the ops, but Steve Grubb seems to think it would be less; it may be
time to look at that a bit more seriously, but if it gets a NACK
regardless it isn't worth the time - thoughts?

  case IORING_OP_OPENAT:
    audit_uring_entry(req->opcode);
    ret = io_openat(req, issue_flags);
    audit_uring_exit(!ret, ret);
    break;
Richard Guy Briggs May 27, 2021, 5:27 p.m. UTC | #27
On 2021-05-26 11:22, Jens Axboe wrote:
> On 5/26/21 9:49 AM, Richard Guy Briggs wrote:
> >> So why is there anything special needed for io_uring (now that the
> >> native worker threads are used)?
> > 
> > Because syscall has been bypassed by a memory-mapped work queue.
> 
> I don't follow this one at all, that's just the delivery mechanism if
> you choose to use SQPOLL. If you do, then a thread sibling of the
> original task does the actual system call. There's no magic involved
> there, and the tasks are related.
> 
> So care to expand on that?

These may be poor examples, but hear me out...

In the case of an open call, I'm guessing that they are mapped 1:1
syscall to io_uring action so that the action can be asynchronous.  So
in this case, there is a record of the action, but we don't see the
result success/failure.  I assume that file paths can only be specified
in the original syscall and not in an SQPOLL action?

In the case of a syscall read/write (which aren't as interesting
generally to audit, but I'm concerned there are other similar situations
that are), the syscall would be called for every buffer that is passed
back and forth user/kernel and vice versa, providing a way to log all
that activity.  In the case of SQPOLL, I understand that a syscall sets
up the action and then io_uring takes over and does the bulk transfer
and we'd not have the visibility of how many times that action was
repeated nor that the result success/failure was due to its asynchrony.

Perhaps I am showing my ignorance, so please correct me if I have it
wrong.

> >> Is there really any io_uring opcode that bypasses the security checks the corresponding native syscall
> >> would do? If so, I think that should just be fixed...
> > 
> > This is by design to speed it up.  This is what Paul's iouring entry and
> > exit hooks do.
> 
> As far as I can tell, we're doing double logging at that point, if the
> syscall helper does the audit as well. We'll get something logging the
> io_uring opcode (eg IORING_OP_OPENAT2), then log again when we call the
> fs helper. That's _assuming_ that we always hit the logging part when we
> call into the vfs, but that's something that can be updated to be true
> and kept an eye on for future additions.
> 
> Why is the double logging useful? It only tells you that the invocation
> was via io_uring as the delivery mechanism rather than the usual system
> call, but the effect is the same - the file is opened, for example.
> 
> I feel like I'm missing something here, or the other side is. Or both!

Paul addressed this in his reply, but let me add a more concrete
example...  There was one corner case I was looking at that showed up
this issue.  Please indicate if I have mischaracterized or
misunderstood.

A syscall would generate records something like this:

	AUDIT_SYSCALL
	AUDIT_...
	AUDIT_EOE

A io_uring SQPOLL event would generate something like this:

	AUDIT_URINGOP
	AUDIT_...
	AUDIT_EOE

The "hybrid" event that is a syscall that starts an io_uring action
would generate something like this:

	AUDIT_URINGOP
	[possible AUDIT_CONFIG_CHANGE (from killed_trees)]
	AUDIT_SYSCALL
	AUDIT_...
	AUDIT_EOE

The AUDIT_... is all the operation-specific records that log parameters
that aren't able to be expressed in the SYSLOG or URINGOP records such
as pathnames, other arguments, and context (pwd, etc...).  So this isn't
"double logging".  It is either introducing an io_uring event, or it is
providing more detail about the io_uring arguments to a syscall event.

> Jens Axboe

- RGB

--
Richard Guy Briggs <rgb@redhat.com>
Sr. S/W Engineer, Kernel Security, Base Operating Systems
Remote, Ottawa, Red Hat Canada
IRC: rgb, SunRaycer
Voice: +1.647.777.2635, Internal: (81) 32635
Paul Moore May 28, 2021, 4:02 p.m. UTC | #28
On Wed, May 26, 2021 at 4:19 PM Paul Moore <paul@paul-moore.com> wrote:
> ... If we moved the _entry
> and _exit calls into the individual operation case blocks (quick
> openat example below) so that only certain operations were able to be
> audited would that be acceptable assuming the high frequency ops were
> untouched?  My initial gut feeling was that this would involve >50% of
> the ops, but Steve Grubb seems to think it would be less; it may be
> time to look at that a bit more seriously, but if it gets a NACK
> regardless it isn't worth the time - thoughts?
>
>   case IORING_OP_OPENAT:
>     audit_uring_entry(req->opcode);
>     ret = io_openat(req, issue_flags);
>     audit_uring_exit(!ret, ret);
>     break;

I wanted to pose this question again in case it was lost in the
thread, I suspect this may be the last option before we have to "fix"
things at the Kconfig level.  I definitely don't want to have to go
that route, and I suspect most everyone on this thread feels the same,
so I'm hopeful we can find a solution that is begrudgingly acceptable
to both groups.
Pavel Begunkov June 2, 2021, 8:26 a.m. UTC | #29
On 5/28/21 5:02 PM, Paul Moore wrote:
> On Wed, May 26, 2021 at 4:19 PM Paul Moore <paul@paul-moore.com> wrote:
>> ... If we moved the _entry
>> and _exit calls into the individual operation case blocks (quick
>> openat example below) so that only certain operations were able to be
>> audited would that be acceptable assuming the high frequency ops were
>> untouched?  My initial gut feeling was that this would involve >50% of
>> the ops, but Steve Grubb seems to think it would be less; it may be
>> time to look at that a bit more seriously, but if it gets a NACK
>> regardless it isn't worth the time - thoughts?
>>
>>   case IORING_OP_OPENAT:
>>     audit_uring_entry(req->opcode);
>>     ret = io_openat(req, issue_flags);
>>     audit_uring_exit(!ret, ret);
>>     break;
> 
> I wanted to pose this question again in case it was lost in the
> thread, I suspect this may be the last option before we have to "fix"
> things at the Kconfig level.  I definitely don't want to have to go
> that route, and I suspect most everyone on this thread feels the same,
> so I'm hopeful we can find a solution that is begrudgingly acceptable
> to both groups.

May work for me, but have to ask how many, and what is the
criteria? I'd think anything opening a file or manipulating fs:

IORING_OP_ACCEPT, IORING_OP_CONNECT, IORING_OP_OPENAT[2],
IORING_OP_RENAMEAT, IORING_OP_UNLINKAT, IORING_OP_SHUTDOWN,
IORING_OP_FILES_UPDATE
+ coming mkdirat and others.

IORING_OP_CLOSE? IORING_OP_SEND IORING_OP_RECV?

What about?
IORING_OP_FSYNC, IORING_OP_SYNC_FILE_RANGE,
IORING_OP_FALLOCATE, IORING_OP_STATX,
IORING_OP_FADVISE, IORING_OP_MADVISE,
IORING_OP_EPOLL_CTL


Another question, io_uring may exercise asynchronous paths,
i.e. io_issue_sqe() returns before requests completes.
Shouldn't be the case for open/etc at the moment, but was that
considered?

I don't see it happening, but would prefer to keep it open
async reimplementation in a distant future. Does audit sleep?
Richard Guy Briggs June 2, 2021, 3:46 p.m. UTC | #30
On 2021-06-02 09:26, Pavel Begunkov wrote:
> On 5/28/21 5:02 PM, Paul Moore wrote:
> > On Wed, May 26, 2021 at 4:19 PM Paul Moore <paul@paul-moore.com> wrote:
> >> ... If we moved the _entry
> >> and _exit calls into the individual operation case blocks (quick
> >> openat example below) so that only certain operations were able to be
> >> audited would that be acceptable assuming the high frequency ops were
> >> untouched?  My initial gut feeling was that this would involve >50% of
> >> the ops, but Steve Grubb seems to think it would be less; it may be
> >> time to look at that a bit more seriously, but if it gets a NACK
> >> regardless it isn't worth the time - thoughts?
> >>
> >>   case IORING_OP_OPENAT:
> >>     audit_uring_entry(req->opcode);
> >>     ret = io_openat(req, issue_flags);
> >>     audit_uring_exit(!ret, ret);
> >>     break;
> > 
> > I wanted to pose this question again in case it was lost in the
> > thread, I suspect this may be the last option before we have to "fix"
> > things at the Kconfig level.  I definitely don't want to have to go
> > that route, and I suspect most everyone on this thread feels the same,
> > so I'm hopeful we can find a solution that is begrudgingly acceptable
> > to both groups.
> 
> May work for me, but have to ask how many, and what is the
> criteria? I'd think anything opening a file or manipulating fs:
> 
> IORING_OP_ACCEPT, IORING_OP_CONNECT, IORING_OP_OPENAT[2],
> IORING_OP_RENAMEAT, IORING_OP_UNLINKAT, IORING_OP_SHUTDOWN,
> IORING_OP_FILES_UPDATE
> + coming mkdirat and others.
> 
> IORING_OP_CLOSE? IORING_OP_SEND IORING_OP_RECV?
> 
> What about?
> IORING_OP_FSYNC, IORING_OP_SYNC_FILE_RANGE,
> IORING_OP_FALLOCATE, IORING_OP_STATX,
> IORING_OP_FADVISE, IORING_OP_MADVISE,
> IORING_OP_EPOLL_CTL
> 
> 
> Another question, io_uring may exercise asynchronous paths,
> i.e. io_issue_sqe() returns before requests completes.
> Shouldn't be the case for open/etc at the moment, but was that
> considered?

This would be why audit needs to monitor a thread until it wraps up, to
wait for the result code.  My understanding is that both sync and async
parts of an op would be monitored.

> I don't see it happening, but would prefer to keep it open
> async reimplementation in a distant future. Does audit sleep?

Some parts do, some parts don't depending on what they are interacting
with in the kernel.  It can be made to not sleep if needed.

> Pavel Begunkov

- RGB

--
Richard Guy Briggs <rgb@redhat.com>
Sr. S/W Engineer, Kernel Security, Base Operating Systems
Remote, Ottawa, Red Hat Canada
IRC: rgb, SunRaycer
Voice: +1.647.777.2635, Internal: (81) 32635
Richard Guy Briggs June 2, 2021, 5:29 p.m. UTC | #31
On 2021-05-21 17:49, Paul Moore wrote:
> WARNING - This is a work in progress and should not be merged
> anywhere important.  It is almost surely not complete, and while it
> probably compiles it likely hasn't been booted and will do terrible
> things.  You have been warned.
> 
> This patch adds basic auditing to io_uring operations, regardless of
> their context.  This is accomplished by allocating audit_context
> structures for the io-wq worker and io_uring SQPOLL kernel threads
> as well as explicitly auditing the io_uring operations in
> io_issue_sqe().  The io_uring operations are audited using a new
> AUDIT_URINGOP record, an example is shown below:
> 
>   % <TODO - insert AUDIT_URINGOP record example>
> 
> Thanks to Richard Guy Briggs for review and feedback.
> 
> Signed-off-by: Paul Moore <paul@paul-moore.com>
> ---
>  fs/io-wq.c                 |    4 +
>  fs/io_uring.c              |   11 +++
>  include/linux/audit.h      |   17 ++++
>  include/uapi/linux/audit.h |    1 
>  kernel/audit.h             |    2 +
>  kernel/auditsc.c           |  173 ++++++++++++++++++++++++++++++++++++++++++++
>  6 files changed, 208 insertions(+)
> 
> diff --git a/fs/io-wq.c b/fs/io-wq.c
> index 5361a9b4b47b..8af09a3336e0 100644
> --- a/fs/io-wq.c
> +++ b/fs/io-wq.c
> @@ -16,6 +16,7 @@
>  #include <linux/rculist_nulls.h>
>  #include <linux/cpu.h>
>  #include <linux/tracehook.h>
> +#include <linux/audit.h>
>  
>  #include "io-wq.h"
>  
> @@ -535,6 +536,8 @@ static int io_wqe_worker(void *data)
>  	snprintf(buf, sizeof(buf), "iou-wrk-%d", wq->task->pid);
>  	set_task_comm(current, buf);
>  
> +	audit_alloc_kernel(current);
> +
>  	while (!test_bit(IO_WQ_BIT_EXIT, &wq->state)) {
>  		long ret;
>  
> @@ -573,6 +576,7 @@ static int io_wqe_worker(void *data)
>  			raw_spin_unlock_irq(&wqe->lock);
>  	}
>  
> +	audit_free(current);
>  	io_worker_exit(worker);
>  	return 0;
>  }
> diff --git a/fs/io_uring.c b/fs/io_uring.c
> index e481ac8a757a..e9941d1ad8fd 100644
> --- a/fs/io_uring.c
> +++ b/fs/io_uring.c
> @@ -78,6 +78,7 @@
>  #include <linux/task_work.h>
>  #include <linux/pagemap.h>
>  #include <linux/io_uring.h>
> +#include <linux/audit.h>
>  
>  #define CREATE_TRACE_POINTS
>  #include <trace/events/io_uring.h>
> @@ -6105,6 +6106,9 @@ static int io_issue_sqe(struct io_kiocb *req, unsigned int issue_flags)
>  	if (req->work.creds && req->work.creds != current_cred())
>  		creds = override_creds(req->work.creds);
>  
> +	if (req->opcode < IORING_OP_LAST)
> +		audit_uring_entry(req->opcode);
> +
>  	switch (req->opcode) {
>  	case IORING_OP_NOP:
>  		ret = io_nop(req, issue_flags);
> @@ -6211,6 +6215,9 @@ static int io_issue_sqe(struct io_kiocb *req, unsigned int issue_flags)
>  		break;
>  	}
>  
> +	if (req->opcode < IORING_OP_LAST)
> +		audit_uring_exit(!ret, ret);
> +
>  	if (creds)
>  		revert_creds(creds);
>  
> @@ -6827,6 +6834,8 @@ static int io_sq_thread(void *data)
>  		set_cpus_allowed_ptr(current, cpu_online_mask);
>  	current->flags |= PF_NO_SETAFFINITY;
>  
> +	audit_alloc_kernel(current);
> +
>  	mutex_lock(&sqd->lock);
>  	/* a user may had exited before the thread started */
>  	io_run_task_work_head(&sqd->park_task_work);
> @@ -6916,6 +6925,8 @@ static int io_sq_thread(void *data)
>  	io_run_task_work_head(&sqd->park_task_work);
>  	mutex_unlock(&sqd->lock);
>  
> +	audit_free(current);
> +
>  	complete(&sqd->exited);
>  	do_exit(0);
>  }
> diff --git a/include/linux/audit.h b/include/linux/audit.h
> index 82b7c1116a85..6a0c013bc7de 100644
> --- a/include/linux/audit.h
> +++ b/include/linux/audit.h
> @@ -286,7 +286,10 @@ static inline int audit_signal_info(int sig, struct task_struct *t)
>  /* These are defined in auditsc.c */
>  				/* Public API */
>  extern int  audit_alloc(struct task_struct *task);
> +extern int  audit_alloc_kernel(struct task_struct *task);
>  extern void __audit_free(struct task_struct *task);
> +extern void __audit_uring_entry(u8 op);
> +extern void __audit_uring_exit(int success, long code);
>  extern void __audit_syscall_entry(int major, unsigned long a0, unsigned long a1,
>  				  unsigned long a2, unsigned long a3);
>  extern void __audit_syscall_exit(int ret_success, long ret_value);
> @@ -323,6 +326,16 @@ static inline void audit_free(struct task_struct *task)
>  	if (unlikely(task->audit_context))
>  		__audit_free(task);
>  }
> +static inline void audit_uring_entry(u8 op)
> +{
> +	if (unlikely(audit_context()))
> +		__audit_uring_entry(op);
> +}
> +static inline void audit_uring_exit(int success, long code)
> +{
> +	if (unlikely(audit_context()))
> +		__audit_uring_exit(success, code);
> +}
>  static inline void audit_syscall_entry(int major, unsigned long a0,
>  				       unsigned long a1, unsigned long a2,
>  				       unsigned long a3)
> @@ -554,6 +567,10 @@ static inline int audit_alloc(struct task_struct *task)
>  {
>  	return 0;
>  }
> +static inline int audit_alloc_kernel(struct task_struct *task)
> +{
> +	return 0;
> +}
>  static inline void audit_free(struct task_struct *task)
>  { }
>  static inline void audit_syscall_entry(int major, unsigned long a0,
> diff --git a/include/uapi/linux/audit.h b/include/uapi/linux/audit.h
> index cd2d8279a5e4..b26e0c435e8b 100644
> --- a/include/uapi/linux/audit.h
> +++ b/include/uapi/linux/audit.h
> @@ -118,6 +118,7 @@
>  #define AUDIT_TIME_ADJNTPVAL	1333	/* NTP value adjustment */
>  #define AUDIT_BPF		1334	/* BPF subsystem */
>  #define AUDIT_EVENT_LISTENER	1335	/* Task joined multicast read socket */
> +#define AUDIT_URINGOP		1336	/* io_uring operation */
>  
>  #define AUDIT_AVC		1400	/* SE Linux avc denial or grant */
>  #define AUDIT_SELINUX_ERR	1401	/* Internal SE Linux Errors */
> diff --git a/kernel/audit.h b/kernel/audit.h
> index fba180de5912..50de827497ca 100644
> --- a/kernel/audit.h
> +++ b/kernel/audit.h
> @@ -100,10 +100,12 @@ struct audit_context {
>  	enum {
>  		AUDIT_CTX_UNUSED,	/* audit_context is currently unused */
>  		AUDIT_CTX_SYSCALL,	/* in use by syscall */
> +		AUDIT_CTX_URING,	/* in use by io_uring */
>  	} context;
>  	enum audit_state    state, current_state;
>  	unsigned int	    serial;     /* serial number for record */
>  	int		    major;      /* syscall number */
> +	int		    uring_op;   /* uring operation */
>  	struct timespec64   ctime;      /* time of syscall entry */
>  	unsigned long	    argv[4];    /* syscall arguments */
>  	long		    return_code;/* syscall return code */
> diff --git a/kernel/auditsc.c b/kernel/auditsc.c
> index cc89e9f9a753..729849d41631 100644
> --- a/kernel/auditsc.c
> +++ b/kernel/auditsc.c
> @@ -953,6 +953,7 @@ static void audit_reset_context(struct audit_context *ctx)
>  	ctx->current_state = ctx->state;
>  	ctx->serial = 0;
>  	ctx->major = 0;
> +	ctx->uring_op = 0;
>  	ctx->ctime = (struct timespec64){ .tv_sec = 0, .tv_nsec = 0 };
>  	memset(ctx->argv, 0, sizeof(ctx->argv));
>  	ctx->return_code = 0;
> @@ -1038,6 +1039,31 @@ int audit_alloc(struct task_struct *tsk)
>  	return 0;
>  }
>  
> +/**
> + * audit_alloc_kernel - allocate an audit_context for a kernel task
> + * @tsk: the kernel task
> + *
> + * Similar to the audit_alloc() function, but intended for kernel private
> + * threads.  Returns zero on success, negative values on failure.
> + */
> +int audit_alloc_kernel(struct task_struct *tsk)
> +{
> +	/*
> +	 * At the moment we are just going to call into audit_alloc() to
> +	 * simplify the code, but there two things to keep in mind with this
> +	 * approach:
> +	 *
> +	 * 1. Filtering internal kernel tasks is a bit laughable in almost all
> +	 * cases, but there is at least one case where there is a benefit:
> +	 * the '-a task,never' case allows the admin to effectively disable
> +	 * task auditing at runtime.
> +	 *
> +	 * 2. The {set,clear}_task_syscall_work() ops likely have zero effect
> +	 * on these internal kernel tasks, but they probably don't hurt either.
> +	 */
> +	return audit_alloc(tsk);
> +}
> +
>  static inline void audit_free_context(struct audit_context *context)
>  {
>  	/* resetting is extra work, but it is likely just noise */
> @@ -1536,6 +1562,52 @@ static void audit_log_proctitle(void)
>  	audit_log_end(ab);
>  }
>  
> +/**
> + * audit_log_uring - generate a AUDIT_URINGOP record
> + * @ctx: the audit context
> + */
> +static void audit_log_uring(struct audit_context *ctx)
> +{
> +	struct audit_buffer *ab;
> +	const struct cred *cred;
> +
> +	/*
> +	 * TODO: What do we log here?  I'm tossing in a few things to start the
> +	 *       conversation, but additional thought needs to go into this.
> +	 */
> +
> +	ab = audit_log_start(ctx, GFP_KERNEL, AUDIT_URINGOP);
> +	if (!ab)
> +		return;
> +	cred = current_cred();

This may need to be req->work.creds.  I haven't been following if the
io_uring thread inherited the user task's creds (and below, comm and
exe).

> +	audit_log_format(ab, "uring_op=%d", ctx->uring_op);

arch is stored below in __audit_uring_entry() and never used in the
AUDIT_CTX_URING case.  That assignment can either be dropped or printed
before uring_op similar to the SYSCALL record.  There aren't really any
arg[0-3] to print.

io_uring_register and io_uring_setup() args are better covered by other
records.  io_uring_enter() has 6 args and the last two aren't covered by
SYSCALL anyways.

> +	if (ctx->return_valid != AUDITSC_INVALID)
> +		audit_log_format(ab, " success=%s exit=%ld",
> +				 (ctx->return_valid == AUDITSC_SUCCESS ?
> +				  "yes" : "no"),
> +				 ctx->return_code);
> +	audit_log_format(ab,
> +			 " items=%d"
> +			 " ppid=%d pid=%d auid=%u uid=%u gid=%u"
> +			 " euid=%u suid=%u fsuid=%u"
> +			 " egid=%u sgid=%u fsgid=%u",
> +			 ctx->name_count,
> +			 task_ppid_nr(current),
> +			 task_tgid_nr(current),
> +			 from_kuid(&init_user_ns, audit_get_loginuid(current)),
> +			 from_kuid(&init_user_ns, cred->uid),
> +			 from_kgid(&init_user_ns, cred->gid),
> +			 from_kuid(&init_user_ns, cred->euid),
> +			 from_kuid(&init_user_ns, cred->suid),
> +			 from_kuid(&init_user_ns, cred->fsuid),
> +			 from_kgid(&init_user_ns, cred->egid),
> +			 from_kgid(&init_user_ns, cred->sgid),
> +			 from_kgid(&init_user_ns, cred->fsgid));

The audit session ID is still important, relevant and qualifies auid.
In keeping with the SYSCALL record format, I think we want to keep
ses=audit_get_sessionid(current) in here.

I'm pretty sure we also want to keep comm= and exe= too, but may have to
reach into req->task to get it.  There are two values for comm possible,
one from the original task and second "iou-sqp-<pid>" set at the top of
io_sq_thread().

I'm reluctant to leave them out now and then have to re-add them in yet
another field order later.

> +	audit_log_task_context(ab);
> +	audit_log_key(ab, ctx->filterkey);
> +	audit_log_end(ab);
> +}
> +
>  static void audit_log_exit(void)
>  {
>  	int i, call_panic = 0;
> @@ -1571,6 +1643,9 @@ static void audit_log_exit(void)
>  		audit_log_key(ab, context->filterkey);
>  		audit_log_end(ab);
>  		break;
> +	case AUDIT_CTX_URING:
> +		audit_log_uring(context);
> +		break;
>  	default:
>  		BUG();
>  		break;
> @@ -1740,6 +1815,104 @@ static void audit_return_fixup(struct audit_context *ctx,
>  	ctx->return_valid = (success ? AUDITSC_SUCCESS : AUDITSC_FAILURE);
>  }
>  
> +/**
> + * __audit_uring_entry - prepare the kernel task's audit context for io_uring
> + * @op: the io_uring opcode
> + *
> + * This is similar to audit_syscall_entry() but is intended for use by io_uring
> + * operations.
> + */
> +void __audit_uring_entry(u8 op)
> +{
> +	struct audit_context *ctx = audit_context();
> +
> +	if (!audit_enabled || !ctx || ctx->state == AUDIT_DISABLED)
> +		return;
> +
> +	/*
> +	 * NOTE: It's possible that we can be called from the process' context
> +	 *       before it returns to userspace, and before audit_syscall_exit()
> +	 *       is called.  In this case there is not much to do, just record
> +	 *       the io_uring details and return.
> +	 */
> +	ctx->uring_op = op;
> +	if (ctx->context == AUDIT_CTX_SYSCALL)
> +		return;
> +
> +	ctx->dummy = !audit_n_rules;
> +	if (!ctx->dummy && ctx->state == AUDIT_BUILD_CONTEXT)
> +		ctx->prio = 0;
> +
> +	ctx->arch = syscall_get_arch(current);
> +	ctx->context = AUDIT_CTX_URING;
> +	ctx->current_state = ctx->state;
> +	ktime_get_coarse_real_ts64(&ctx->ctime);
> +}
> +
> +/**
> + * __audit_uring_exit - wrap up the kernel task's audit context after io_uring
> + * @success: true/false value to indicate if the operation succeeded or not
> + * @code: operation return code
> + *
> + * This is similar to audit_syscall_exit() but is intended for use by io_uring
> + * operations.
> + */
> +void __audit_uring_exit(int success, long code)
> +{
> +	struct audit_context *ctx = audit_context();
> +
> +	/*
> +	 * TODO: At some point we will likely want to filter on io_uring ops
> +	 *       and other things similar to what we do for syscalls, but that
> +	 *       is something for another day; just record what we can here.
> +	 */
> +
> +	if (!ctx || ctx->dummy)
> +		goto out;
> +	if (ctx->context == AUDIT_CTX_SYSCALL) {
> +		/*
> +		 * NOTE: See the note in __audit_uring_entry() about the case
> +		 *       where we may be called from process context before we
> +		 *       return to userspace via audit_syscall_exit().  In this
> +		 *       case we simply emit a URINGOP record and bail, the
> +		 *       normal syscall exit handling will take care of
> +		 *       everything else.
> +		 *       It is also worth mentioning that when we are called,
> +		 *       the current process creds may differ from the creds
> +		 *       used during the normal syscall processing; keep that
> +		 *       in mind if/when we move the record generation code.
> +		 */
> +
> +		/*
> +		 * We need to filter on the syscall info here to decide if we
> +		 * should emit a URINGOP record.  I know it seems odd but this
> +		 * solves the problem where users have a filter to block *all*
> +		 * syscall records in the "exit" filter; we want to preserve
> +		 * the behavior here.
> +		 */
> +		audit_filter_syscall(current, ctx);
> +		audit_filter_inodes(current, ctx);
> +		if (ctx->current_state != AUDIT_RECORD_CONTEXT)
> +			return;
> +
> +		audit_log_uring(ctx);
> +		return;
> +	}
> +
> +	/* this may generate CONFIG_CHANGE records */
> +	if (!list_empty(&ctx->killed_trees))
> +		audit_kill_trees(ctx);
> +
> +	audit_filter_inodes(current, ctx);
> +	if (ctx->current_state != AUDIT_RECORD_CONTEXT)
> +		goto out;
> +	audit_return_fixup(ctx, success, code);
> +	audit_log_exit();
> +
> +out:
> +	audit_reset_context(ctx);
> +}
> +
>  /**
>   * __audit_syscall_entry - fill in an audit record at syscall entry
>   * @major: major syscall type (function)
> 
> --
> Linux-audit mailing list
> Linux-audit@redhat.com
> https://listman.redhat.com/mailman/listinfo/linux-audit

- RGB

--
Richard Guy Briggs <rgb@redhat.com>
Sr. S/W Engineer, Kernel Security, Base Operating Systems
Remote, Ottawa, Red Hat Canada
IRC: rgb, SunRaycer
Voice: +1.647.777.2635, Internal: (81) 32635
Paul Moore June 2, 2021, 7:46 p.m. UTC | #32
On Wed, Jun 2, 2021 at 4:27 AM Pavel Begunkov <asml.silence@gmail.com> wrote:
> On 5/28/21 5:02 PM, Paul Moore wrote:
> > On Wed, May 26, 2021 at 4:19 PM Paul Moore <paul@paul-moore.com> wrote:
> >> ... If we moved the _entry
> >> and _exit calls into the individual operation case blocks (quick
> >> openat example below) so that only certain operations were able to be
> >> audited would that be acceptable assuming the high frequency ops were
> >> untouched?  My initial gut feeling was that this would involve >50% of
> >> the ops, but Steve Grubb seems to think it would be less; it may be
> >> time to look at that a bit more seriously, but if it gets a NACK
> >> regardless it isn't worth the time - thoughts?
> >>
> >>   case IORING_OP_OPENAT:
> >>     audit_uring_entry(req->opcode);
> >>     ret = io_openat(req, issue_flags);
> >>     audit_uring_exit(!ret, ret);
> >>     break;
> >
> > I wanted to pose this question again in case it was lost in the
> > thread, I suspect this may be the last option before we have to "fix"
> > things at the Kconfig level.  I definitely don't want to have to go
> > that route, and I suspect most everyone on this thread feels the same,
> > so I'm hopeful we can find a solution that is begrudgingly acceptable
> > to both groups.
>
> May work for me, but have to ask how many, and what is the
> criteria? I'd think anything opening a file or manipulating fs:
>
> IORING_OP_ACCEPT, IORING_OP_CONNECT, IORING_OP_OPENAT[2],
> IORING_OP_RENAMEAT, IORING_OP_UNLINKAT, IORING_OP_SHUTDOWN,
> IORING_OP_FILES_UPDATE
> + coming mkdirat and others.
>
> IORING_OP_CLOSE? IORING_OP_SEND IORING_OP_RECV?
>
> What about?
> IORING_OP_FSYNC, IORING_OP_SYNC_FILE_RANGE,
> IORING_OP_FALLOCATE, IORING_OP_STATX,
> IORING_OP_FADVISE, IORING_OP_MADVISE,
> IORING_OP_EPOLL_CTL

Looking quickly at v5.13-rc4 the following seems like candidates for
auditing, there may be a small number of subtractions/additions to
this list as people take a closer look, but it should serve as a
starting point:

IORING_OP_SENDMSG
IORING_OP_RECVMSG
IORING_OP_ACCEPT
IORING_OP_CONNECT
IORING_OP_FALLOCATE
IORING_OP_OPENAT
IORING_OP_CLOSE
IORING_OP_MADVISE
IORING_OP_OPENAT2
IORING_OP_SHUTDOWN
IORING_OP_RENAMEAT
IORING_OP_UNLINKAT

... can you live with that list?

> Another question, io_uring may exercise asynchronous paths,
> i.e. io_issue_sqe() returns before requests completes.
> Shouldn't be the case for open/etc at the moment, but was that
> considered?

Yes, I noticed that when testing the code (and it makes sense when you
look at how io_uring handles things).  Depending on the state of the
system when the io_uring request is submitted I've seen both sync and
async io_uring operations with the associated different calling
contexts.  In the case where io_issue_sqe() needs to defer the
operation to a different context you will see an audit record
indicating that the operation failed and then another audit record
when it completes; it's actually pretty interesting to be able to see
how the system and io_uring are working.

We could always mask out these delayed attempts, but at this early
stage they are helpful, and they may be useful for admins.

> I don't see it happening, but would prefer to keep it open
> async reimplementation in a distant future. Does audit sleep?

The only place in the audit_uring_entry()/audit_uring_exit() code path
that could sleep at present is the call to audit_log_uring() which is
made when the rules dictate that an audit record be generated.  The
offending code is an allocation in audit_log_uring() which is
currently GFP_KERNEL but really should be GFP_ATOMIC, or similar.  It
was a copy-n-paste from the similar syscall function where GFP_KERNEL
is appropriate due to the calling context at the end of the syscall.
I'll change that as soon as I'm done with this email.

Of course if you are calling io_uring_enter(2), or something similar,
then audit may sleep as part of the normal syscall processing (as
mentioned above), but that is due to the fact that io_uring_enter(2)
is a syscall and not because of anything in io_issue_sqe().
Paul Moore June 2, 2021, 8:46 p.m. UTC | #33
On Wed, Jun 2, 2021 at 1:29 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> On 2021-05-21 17:49, Paul Moore wrote:
> > WARNING - This is a work in progress and should not be merged
> > anywhere important.  It is almost surely not complete, and while it
> > probably compiles it likely hasn't been booted and will do terrible
> > things.  You have been warned.
> >
> > This patch adds basic auditing to io_uring operations, regardless of
> > their context.  This is accomplished by allocating audit_context
> > structures for the io-wq worker and io_uring SQPOLL kernel threads
> > as well as explicitly auditing the io_uring operations in
> > io_issue_sqe().  The io_uring operations are audited using a new
> > AUDIT_URINGOP record, an example is shown below:
> >
> >   % <TODO - insert AUDIT_URINGOP record example>
> >
> > Thanks to Richard Guy Briggs for review and feedback.
> >
> > Signed-off-by: Paul Moore <paul@paul-moore.com>
> > ---
> >  fs/io-wq.c                 |    4 +
> >  fs/io_uring.c              |   11 +++
> >  include/linux/audit.h      |   17 ++++
> >  include/uapi/linux/audit.h |    1
> >  kernel/audit.h             |    2 +
> >  kernel/auditsc.c           |  173 ++++++++++++++++++++++++++++++++++++++++++++
> >  6 files changed, 208 insertions(+)

...

> > diff --git a/fs/io_uring.c b/fs/io_uring.c
> > index e481ac8a757a..e9941d1ad8fd 100644
> > --- a/fs/io_uring.c
> > +++ b/fs/io_uring.c
> > @@ -78,6 +78,7 @@
> >  #include <linux/task_work.h>
> >  #include <linux/pagemap.h>
> >  #include <linux/io_uring.h>
> > +#include <linux/audit.h>
> >
> >  #define CREATE_TRACE_POINTS
> >  #include <trace/events/io_uring.h>
> > @@ -6105,6 +6106,9 @@ static int io_issue_sqe(struct io_kiocb *req, unsigned int issue_flags)
> >       if (req->work.creds && req->work.creds != current_cred())
> >               creds = override_creds(req->work.creds);
> >
> > +     if (req->opcode < IORING_OP_LAST)
> > +             audit_uring_entry(req->opcode);

Note well the override_creds() call right above the audit code that is
being added, it will be important later in this email.

> > diff --git a/kernel/auditsc.c b/kernel/auditsc.c
> > index cc89e9f9a753..729849d41631 100644
> > --- a/kernel/auditsc.c
> > +++ b/kernel/auditsc.c
> > @@ -1536,6 +1562,52 @@ static void audit_log_proctitle(void)
> >       audit_log_end(ab);
> >  }
> >
> > +/**
> > + * audit_log_uring - generate a AUDIT_URINGOP record
> > + * @ctx: the audit context
> > + */
> > +static void audit_log_uring(struct audit_context *ctx)
> > +{
> > +     struct audit_buffer *ab;
> > +     const struct cred *cred;
> > +
> > +     /*
> > +      * TODO: What do we log here?  I'm tossing in a few things to start the
> > +      *       conversation, but additional thought needs to go into this.
> > +      */
> > +
> > +     ab = audit_log_start(ctx, GFP_KERNEL, AUDIT_URINGOP);
> > +     if (!ab)
> > +             return;
> > +     cred = current_cred();
>
> This may need to be req->work.creds.  I haven't been following if the
> io_uring thread inherited the user task's creds (and below, comm and
> exe).

Nope, we're good.  See the existing code in io_issue_sqe() :)

> > +     audit_log_format(ab, "uring_op=%d", ctx->uring_op);
>
> arch is stored below in __audit_uring_entry() and never used in the
> AUDIT_CTX_URING case.  That assignment can either be dropped or printed
> before uring_op similar to the SYSCALL record.

Good point, I'll drop the code that records the arch from _entry(); it
is really only useful to give the appropriate context if needed for
other things in the audit stream, and that isn't the case like it is
with syscalls.

> There aren't really any arg[0-3] to print.

Which is why I didn't print them.

> io_uring_register and io_uring_setup() args are better covered by other
> records.  io_uring_enter() has 6 args and the last two aren't covered by
> SYSCALL anyways.

 ???

I think you are confusing the io_uring ops with syscalls; they are
very different things from an audit perspective and the io_uring
auditing is not intended to replace syscall records.  The
io_uring_setup() and io_uring_enter() syscalls will be auditing just
as any other syscalls would be using the existing syscall audit code.

> > +     if (ctx->return_valid != AUDITSC_INVALID)
> > +             audit_log_format(ab, " success=%s exit=%ld",
> > +                              (ctx->return_valid == AUDITSC_SUCCESS ?
> > +                               "yes" : "no"),
> > +                              ctx->return_code);
> > +     audit_log_format(ab,
> > +                      " items=%d"
> > +                      " ppid=%d pid=%d auid=%u uid=%u gid=%u"
> > +                      " euid=%u suid=%u fsuid=%u"
> > +                      " egid=%u sgid=%u fsgid=%u",
> > +                      ctx->name_count,
> > +                      task_ppid_nr(current),
> > +                      task_tgid_nr(current),
> > +                      from_kuid(&init_user_ns, audit_get_loginuid(current)),
> > +                      from_kuid(&init_user_ns, cred->uid),
> > +                      from_kgid(&init_user_ns, cred->gid),
> > +                      from_kuid(&init_user_ns, cred->euid),
> > +                      from_kuid(&init_user_ns, cred->suid),
> > +                      from_kuid(&init_user_ns, cred->fsuid),
> > +                      from_kgid(&init_user_ns, cred->egid),
> > +                      from_kgid(&init_user_ns, cred->sgid),
> > +                      from_kgid(&init_user_ns, cred->fsgid));
>
> The audit session ID is still important, relevant and qualifies auid.
> In keeping with the SYSCALL record format, I think we want to keep
> ses=audit_get_sessionid(current) in here.

This might be another case of syscall/io_uring confusion.  An io_uring
op doesn't necessarily have an audit session ID or an audit UID in the
conventional sense; for example think about SQPOLL works, shared
rings, etc.

> I'm pretty sure we also want to keep comm= and exe= too, but may have to
> reach into req->task to get it.  There are two values for comm possible,
> one from the original task and second "iou-sqp-<pid>" set at the top of
> io_sq_thread().

I think this is more syscall/io_uring confusion.
Pavel Begunkov June 3, 2021, 10:39 a.m. UTC | #34
On 6/2/21 4:46 PM, Richard Guy Briggs wrote:
> On 2021-06-02 09:26, Pavel Begunkov wrote:
>> On 5/28/21 5:02 PM, Paul Moore wrote:
>>> On Wed, May 26, 2021 at 4:19 PM Paul Moore <paul@paul-moore.com> wrote:
>>>> ... If we moved the _entry
>>>> and _exit calls into the individual operation case blocks (quick
>>>> openat example below) so that only certain operations were able to be
>>>> audited would that be acceptable assuming the high frequency ops were
>>>> untouched?  My initial gut feeling was that this would involve >50% of
>>>> the ops, but Steve Grubb seems to think it would be less; it may be
>>>> time to look at that a bit more seriously, but if it gets a NACK
>>>> regardless it isn't worth the time - thoughts?
>>>>
>>>>   case IORING_OP_OPENAT:
>>>>     audit_uring_entry(req->opcode);
>>>>     ret = io_openat(req, issue_flags);
>>>>     audit_uring_exit(!ret, ret);
>>>>     break;
>>>
>>> I wanted to pose this question again in case it was lost in the
>>> thread, I suspect this may be the last option before we have to "fix"
>>> things at the Kconfig level.  I definitely don't want to have to go
>>> that route, and I suspect most everyone on this thread feels the same,
>>> so I'm hopeful we can find a solution that is begrudgingly acceptable
>>> to both groups.
>>
>> May work for me, but have to ask how many, and what is the
>> criteria? I'd think anything opening a file or manipulating fs:
>>
>> IORING_OP_ACCEPT, IORING_OP_CONNECT, IORING_OP_OPENAT[2],
>> IORING_OP_RENAMEAT, IORING_OP_UNLINKAT, IORING_OP_SHUTDOWN,
>> IORING_OP_FILES_UPDATE
>> + coming mkdirat and others.
>>
>> IORING_OP_CLOSE? IORING_OP_SEND IORING_OP_RECV?
>>
>> What about?
>> IORING_OP_FSYNC, IORING_OP_SYNC_FILE_RANGE,
>> IORING_OP_FALLOCATE, IORING_OP_STATX,
>> IORING_OP_FADVISE, IORING_OP_MADVISE,
>> IORING_OP_EPOLL_CTL
>>
>>
>> Another question, io_uring may exercise asynchronous paths,
>> i.e. io_issue_sqe() returns before requests completes.
>> Shouldn't be the case for open/etc at the moment, but was that
>> considered?
> 
> This would be why audit needs to monitor a thread until it wraps up, to
> wait for the result code.  My understanding is that both sync and async
> parts of an op would be monitored.

There may be a misunderstanding

audit_start(req)
ret = io_issue_sqe(req);
audit_end(ret);

io_issue_sqe() may return 0 but leave the request inflight,
which will be completed asynchronously e.g. by IRQ, not going
through io_issue_sqe() or any io_read()/etc helpers again, and
after last audit_end() had already happened.
That's the case with read/write/timeout, but is not true for
open/etc.

>> I don't see it happening, but would prefer to keep it open
>> async reimplementation in a distant future. Does audit sleep?
> 
> Some parts do, some parts don't depending on what they are interacting
> with in the kernel.  It can be made to not sleep if needed.

Ok, good
Pavel Begunkov June 3, 2021, 10:51 a.m. UTC | #35
On 6/2/21 8:46 PM, Paul Moore wrote:
> On Wed, Jun 2, 2021 at 4:27 AM Pavel Begunkov <asml.silence@gmail.com> wrote:
>> On 5/28/21 5:02 PM, Paul Moore wrote:
>>> On Wed, May 26, 2021 at 4:19 PM Paul Moore <paul@paul-moore.com> wrote:
>>>> ... If we moved the _entry
>>>> and _exit calls into the individual operation case blocks (quick
>>>> openat example below) so that only certain operations were able to be
>>>> audited would that be acceptable assuming the high frequency ops were
>>>> untouched?  My initial gut feeling was that this would involve >50% of
>>>> the ops, but Steve Grubb seems to think it would be less; it may be
>>>> time to look at that a bit more seriously, but if it gets a NACK
>>>> regardless it isn't worth the time - thoughts?
>>>>
>>>>   case IORING_OP_OPENAT:
>>>>     audit_uring_entry(req->opcode);
>>>>     ret = io_openat(req, issue_flags);
>>>>     audit_uring_exit(!ret, ret);
>>>>     break;
>>>
>>> I wanted to pose this question again in case it was lost in the
>>> thread, I suspect this may be the last option before we have to "fix"
>>> things at the Kconfig level.  I definitely don't want to have to go
>>> that route, and I suspect most everyone on this thread feels the same,
>>> so I'm hopeful we can find a solution that is begrudgingly acceptable
>>> to both groups.
>>
>> May work for me, but have to ask how many, and what is the
>> criteria? I'd think anything opening a file or manipulating fs:
>>
>> IORING_OP_ACCEPT, IORING_OP_CONNECT, IORING_OP_OPENAT[2],
>> IORING_OP_RENAMEAT, IORING_OP_UNLINKAT, IORING_OP_SHUTDOWN,
>> IORING_OP_FILES_UPDATE
>> + coming mkdirat and others.
>>
>> IORING_OP_CLOSE? IORING_OP_SEND IORING_OP_RECV?
>>
>> What about?
>> IORING_OP_FSYNC, IORING_OP_SYNC_FILE_RANGE,
>> IORING_OP_FALLOCATE, IORING_OP_STATX,
>> IORING_OP_FADVISE, IORING_OP_MADVISE,
>> IORING_OP_EPOLL_CTL
> 
> Looking quickly at v5.13-rc4 the following seems like candidates for
> auditing, there may be a small number of subtractions/additions to
> this list as people take a closer look, but it should serve as a
> starting point:
> 
> IORING_OP_SENDMSG
> IORING_OP_RECVMSG
> IORING_OP_ACCEPT
> IORING_OP_CONNECT
> IORING_OP_FALLOCATE
> IORING_OP_OPENAT
> IORING_OP_CLOSE
> IORING_OP_MADVISE
> IORING_OP_OPENAT2
> IORING_OP_SHUTDOWN
> IORING_OP_RENAMEAT
> IORING_OP_UNLINKAT
> 
> ... can you live with that list?

it will bloat binary somewhat, but considering it's all in one
place -- io_issue_sqe(), it's workable.

Not nice to have send/recv msg in the list, but I admit they
may do some crazy things. What can be traced for them? Because
at the moment of issue_sqe() not everything is read from the
userspace.

see: io_sendmsg() { ...; io_sendmsg_copy_hdr(); },

will copy header only in io_sendmsg() in most cases, and
then stash it for re-issuing if needed.


>> Another question, io_uring may exercise asynchronous paths,
>> i.e. io_issue_sqe() returns before requests completes.
>> Shouldn't be the case for open/etc at the moment, but was that
>> considered?
> 
> Yes, I noticed that when testing the code (and it makes sense when you
> look at how io_uring handles things).  Depending on the state of the
> system when the io_uring request is submitted I've seen both sync and
> async io_uring operations with the associated different calling
> contexts.  In the case where io_issue_sqe() needs to defer the
> operation to a different context you will see an audit record
> indicating that the operation failed and then another audit record
> when it completes; it's actually pretty interesting to be able to see
> how the system and io_uring are working.

Copying a reply to another message to keep clear out
of misunderstanding.

"io_issue_sqe() may return 0 but leave the request inflight,
which will be completed asynchronously e.g. by IRQ, not going
through io_issue_sqe() or any io_read()/etc helpers again, and
after last audit_end() had already happened.
That's the case with read/write/timeout, but is not true for
open/etc."

And there is interest in async send/recv[msg] as well (via
IRQ as described, callbacks, etc.).
 
> We could always mask out these delayed attempts, but at this early
> stage they are helpful, and they may be useful for admins.
> 
>> I don't see it happening, but would prefer to keep it open
>> async reimplementation in a distant future. Does audit sleep?
> 
> The only place in the audit_uring_entry()/audit_uring_exit() code path
> that could sleep at present is the call to audit_log_uring() which is
> made when the rules dictate that an audit record be generated.  The
> offending code is an allocation in audit_log_uring() which is
> currently GFP_KERNEL but really should be GFP_ATOMIC, or similar.  It
> was a copy-n-paste from the similar syscall function where GFP_KERNEL
> is appropriate due to the calling context at the end of the syscall.
> I'll change that as soon as I'm done with this email.

Ok, depends where it steers, but there may be a requirement to
not sleep for some hooks because of not having a sleepable context.

> 
> Of course if you are calling io_uring_enter(2), or something similar,
> then audit may sleep as part of the normal syscall processing (as
> mentioned above), but that is due to the fact that io_uring_enter(2)
> is a syscall and not because of anything in io_issue_sqe().
>
Jens Axboe June 3, 2021, 3:54 p.m. UTC | #36
On 5/28/21 10:02 AM, Paul Moore wrote:
> On Wed, May 26, 2021 at 4:19 PM Paul Moore <paul@paul-moore.com> wrote:
>> ... If we moved the _entry
>> and _exit calls into the individual operation case blocks (quick
>> openat example below) so that only certain operations were able to be
>> audited would that be acceptable assuming the high frequency ops were
>> untouched?  My initial gut feeling was that this would involve >50% of
>> the ops, but Steve Grubb seems to think it would be less; it may be
>> time to look at that a bit more seriously, but if it gets a NACK
>> regardless it isn't worth the time - thoughts?
>>
>>   case IORING_OP_OPENAT:
>>     audit_uring_entry(req->opcode);
>>     ret = io_openat(req, issue_flags);
>>     audit_uring_exit(!ret, ret);
>>     break;
> 
> I wanted to pose this question again in case it was lost in the
> thread, I suspect this may be the last option before we have to "fix"
> things at the Kconfig level.  I definitely don't want to have to go
> that route, and I suspect most everyone on this thread feels the same,
> so I'm hopeful we can find a solution that is begrudgingly acceptable
> to both groups.

Sorry for the lack of response here, but to sum up my order of
preference:

1) It's probably better to just make the audit an opt-out in io_op_defs
   for each opcode, and avoid needing boiler plate code for each op
   handler. The opt-out would ensure that new opcodes get it by default
   it someone doesn't know what it is, and the io_op_defs addition would
   mean that it's in generic code rather then in the handlers. Yes it's
   a bit slower, but it's saner imho.

2) With the above, I'm fine with adding this to io_uring. I don't think
   going the route of mutual exclusion in kconfig helps anyone, it'd
   be counter productive to both sides.

Hope that works and helps move this forward. I'll be mostly out of touch
the next week and a half, but wanted to ensure that I sent out my
(brief) thoughts before going away.
Casey Schaufler June 3, 2021, 3:54 p.m. UTC | #37
On 6/3/2021 3:51 AM, Pavel Begunkov wrote:
> On 6/2/21 8:46 PM, Paul Moore wrote:
>> On Wed, Jun 2, 2021 at 4:27 AM Pavel Begunkov <asml.silence@gmail.com> wrote:
>>> On 5/28/21 5:02 PM, Paul Moore wrote:
>>>> On Wed, May 26, 2021 at 4:19 PM Paul Moore <paul@paul-moore.com> wrote:
>>>>> ... If we moved the _entry
>>>>> and _exit calls into the individual operation case blocks (quick
>>>>> openat example below) so that only certain operations were able to be
>>>>> audited would that be acceptable assuming the high frequency ops were
>>>>> untouched?  My initial gut feeling was that this would involve >50% of
>>>>> the ops, but Steve Grubb seems to think it would be less; it may be
>>>>> time to look at that a bit more seriously, but if it gets a NACK
>>>>> regardless it isn't worth the time - thoughts?
>>>>>
>>>>>   case IORING_OP_OPENAT:
>>>>>     audit_uring_entry(req->opcode);
>>>>>     ret = io_openat(req, issue_flags);
>>>>>     audit_uring_exit(!ret, ret);
>>>>>     break;
>>>> I wanted to pose this question again in case it was lost in the
>>>> thread, I suspect this may be the last option before we have to "fix"
>>>> things at the Kconfig level.  I definitely don't want to have to go
>>>> that route, and I suspect most everyone on this thread feels the same,
>>>> so I'm hopeful we can find a solution that is begrudgingly acceptable
>>>> to both groups.
>>> May work for me, but have to ask how many, and what is the
>>> criteria? I'd think anything opening a file or manipulating fs:
>>>
>>> IORING_OP_ACCEPT, IORING_OP_CONNECT, IORING_OP_OPENAT[2],
>>> IORING_OP_RENAMEAT, IORING_OP_UNLINKAT, IORING_OP_SHUTDOWN,
>>> IORING_OP_FILES_UPDATE
>>> + coming mkdirat and others.
>>>
>>> IORING_OP_CLOSE? IORING_OP_SEND IORING_OP_RECV?
>>>
>>> What about?
>>> IORING_OP_FSYNC, IORING_OP_SYNC_FILE_RANGE,
>>> IORING_OP_FALLOCATE, IORING_OP_STATX,
>>> IORING_OP_FADVISE, IORING_OP_MADVISE,
>>> IORING_OP_EPOLL_CTL
>> Looking quickly at v5.13-rc4 the following seems like candidates for
>> auditing, there may be a small number of subtractions/additions to
>> this list as people take a closer look, but it should serve as a
>> starting point:
>>
>> IORING_OP_SENDMSG
>> IORING_OP_RECVMSG
>> IORING_OP_ACCEPT
>> IORING_OP_CONNECT
>> IORING_OP_FALLOCATE
>> IORING_OP_OPENAT
>> IORING_OP_CLOSE
>> IORING_OP_MADVISE
>> IORING_OP_OPENAT2
>> IORING_OP_SHUTDOWN
>> IORING_OP_RENAMEAT
>> IORING_OP_UNLINKAT
>>
>> ... can you live with that list?
> it will bloat binary somewhat, but considering it's all in one
> place -- io_issue_sqe(), it's workable.
>
> Not nice to have send/recv msg in the list, but I admit they
> may do some crazy things. What can be traced for them?

Both SELinux and Smack do access checks on packet operations.
As access may be denied by these checks, audit needs to be available.
This is true for UDS, IP and at least one other protocol family.

>  Because
> at the moment of issue_sqe() not everything is read from the
> userspace.
>
> see: io_sendmsg() { ...; io_sendmsg_copy_hdr(); },
>
> will copy header only in io_sendmsg() in most cases, and
> then stash it for re-issuing if needed.
>
>
>>> Another question, io_uring may exercise asynchronous paths,
>>> i.e. io_issue_sqe() returns before requests completes.
>>> Shouldn't be the case for open/etc at the moment, but was that
>>> considered?
>> Yes, I noticed that when testing the code (and it makes sense when you
>> look at how io_uring handles things).  Depending on the state of the
>> system when the io_uring request is submitted I've seen both sync and
>> async io_uring operations with the associated different calling
>> contexts.  In the case where io_issue_sqe() needs to defer the
>> operation to a different context you will see an audit record
>> indicating that the operation failed and then another audit record
>> when it completes; it's actually pretty interesting to be able to see
>> how the system and io_uring are working.
> Copying a reply to another message to keep clear out
> of misunderstanding.
>
> "io_issue_sqe() may return 0 but leave the request inflight,
> which will be completed asynchronously e.g. by IRQ, not going
> through io_issue_sqe() or any io_read()/etc helpers again, and
> after last audit_end() had already happened.
> That's the case with read/write/timeout, but is not true for
> open/etc."
>
> And there is interest in async send/recv[msg] as well (via
> IRQ as described, callbacks, etc.).
>  
>> We could always mask out these delayed attempts, but at this early
>> stage they are helpful, and they may be useful for admins.
>>
>>> I don't see it happening, but would prefer to keep it open
>>> async reimplementation in a distant future. Does audit sleep?
>> The only place in the audit_uring_entry()/audit_uring_exit() code path
>> that could sleep at present is the call to audit_log_uring() which is
>> made when the rules dictate that an audit record be generated.  The
>> offending code is an allocation in audit_log_uring() which is
>> currently GFP_KERNEL but really should be GFP_ATOMIC, or similar.  It
>> was a copy-n-paste from the similar syscall function where GFP_KERNEL
>> is appropriate due to the calling context at the end of the syscall.
>> I'll change that as soon as I'm done with this email.
> Ok, depends where it steers, but there may be a requirement to
> not sleep for some hooks because of not having a sleepable context.
>
>> Of course if you are calling io_uring_enter(2), or something similar,
>> then audit may sleep as part of the normal syscall processing (as
>> mentioned above), but that is due to the fact that io_uring_enter(2)
>> is a syscall and not because of anything in io_issue_sqe().
>>
Paul Moore June 4, 2021, 5:04 a.m. UTC | #38
On Thu, Jun 3, 2021 at 11:54 AM Jens Axboe <axboe@kernel.dk> wrote:
> On 5/28/21 10:02 AM, Paul Moore wrote:
> > On Wed, May 26, 2021 at 4:19 PM Paul Moore <paul@paul-moore.com> wrote:
> >> ... If we moved the _entry
> >> and _exit calls into the individual operation case blocks (quick
> >> openat example below) so that only certain operations were able to be
> >> audited would that be acceptable assuming the high frequency ops were
> >> untouched?  My initial gut feeling was that this would involve >50% of
> >> the ops, but Steve Grubb seems to think it would be less; it may be
> >> time to look at that a bit more seriously, but if it gets a NACK
> >> regardless it isn't worth the time - thoughts?
> >>
> >>   case IORING_OP_OPENAT:
> >>     audit_uring_entry(req->opcode);
> >>     ret = io_openat(req, issue_flags);
> >>     audit_uring_exit(!ret, ret);
> >>     break;
> >
> > I wanted to pose this question again in case it was lost in the
> > thread, I suspect this may be the last option before we have to "fix"
> > things at the Kconfig level.  I definitely don't want to have to go
> > that route, and I suspect most everyone on this thread feels the same,
> > so I'm hopeful we can find a solution that is begrudgingly acceptable
> > to both groups.
>
> Sorry for the lack of response here, but to sum up my order of
> preference:
>
> 1) It's probably better to just make the audit an opt-out in io_op_defs
>    for each opcode, and avoid needing boiler plate code for each op
>    handler. The opt-out would ensure that new opcodes get it by default
>    it someone doesn't know what it is, and the io_op_defs addition would
>    mean that it's in generic code rather then in the handlers. Yes it's
>    a bit slower, but it's saner imho.
>
> 2) With the above, I'm fine with adding this to io_uring. I don't think
>    going the route of mutual exclusion in kconfig helps anyone, it'd
>    be counter productive to both sides.
>
> Hope that works and helps move this forward. I'll be mostly out of touch
> the next week and a half, but wanted to ensure that I sent out my
> (brief) thoughts before going away.

Thanks Jens.  I'll revise the patchset based on this (basically doing
an opt-out version of what you did on May 26th) and do a v2 post with
the other accumulated fixes/changes.  If there is anything else that
needs discussion/review I'm sure Pavel can help us out, he's been
helpful thus far.

--
paul moore
www.paul-moore.com
Richard Guy Briggs Aug. 25, 2021, 1:21 a.m. UTC | #39
On 2021-06-02 13:46, Paul Moore wrote:
> On Wed, Jun 2, 2021 at 1:29 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> > On 2021-05-21 17:49, Paul Moore wrote:
> > > WARNING - This is a work in progress and should not be merged
> > > anywhere important.  It is almost surely not complete, and while it
> > > probably compiles it likely hasn't been booted and will do terrible
> > > things.  You have been warned.
> > >
> > > This patch adds basic auditing to io_uring operations, regardless of
> > > their context.  This is accomplished by allocating audit_context
> > > structures for the io-wq worker and io_uring SQPOLL kernel threads
> > > as well as explicitly auditing the io_uring operations in
> > > io_issue_sqe().  The io_uring operations are audited using a new
> > > AUDIT_URINGOP record, an example is shown below:
> > >
> > >   % <TODO - insert AUDIT_URINGOP record example>
> > >
> > > Thanks to Richard Guy Briggs for review and feedback.
> > >
> > > Signed-off-by: Paul Moore <paul@paul-moore.com>
> > > ---
> > >  fs/io-wq.c                 |    4 +
> > >  fs/io_uring.c              |   11 +++
> > >  include/linux/audit.h      |   17 ++++
> > >  include/uapi/linux/audit.h |    1
> > >  kernel/audit.h             |    2 +
> > >  kernel/auditsc.c           |  173 ++++++++++++++++++++++++++++++++++++++++++++
> > >  6 files changed, 208 insertions(+)
> 
> ...
> 
> > > diff --git a/fs/io_uring.c b/fs/io_uring.c
> > > index e481ac8a757a..e9941d1ad8fd 100644
> > > --- a/fs/io_uring.c
> > > +++ b/fs/io_uring.c
> > > @@ -78,6 +78,7 @@
> > >  #include <linux/task_work.h>
> > >  #include <linux/pagemap.h>
> > >  #include <linux/io_uring.h>
> > > +#include <linux/audit.h>
> > >
> > >  #define CREATE_TRACE_POINTS
> > >  #include <trace/events/io_uring.h>
> > > @@ -6105,6 +6106,9 @@ static int io_issue_sqe(struct io_kiocb *req, unsigned int issue_flags)
> > >       if (req->work.creds && req->work.creds != current_cred())
> > >               creds = override_creds(req->work.creds);
> > >
> > > +     if (req->opcode < IORING_OP_LAST)
> > > +             audit_uring_entry(req->opcode);
> 
> Note well the override_creds() call right above the audit code that is
> being added, it will be important later in this email.
> 
> > > diff --git a/kernel/auditsc.c b/kernel/auditsc.c
> > > index cc89e9f9a753..729849d41631 100644
> > > --- a/kernel/auditsc.c
> > > +++ b/kernel/auditsc.c
> > > @@ -1536,6 +1562,52 @@ static void audit_log_proctitle(void)
> > >       audit_log_end(ab);
> > >  }
> > >
> > > +/**
> > > + * audit_log_uring - generate a AUDIT_URINGOP record
> > > + * @ctx: the audit context
> > > + */
> > > +static void audit_log_uring(struct audit_context *ctx)
> > > +{
> > > +     struct audit_buffer *ab;
> > > +     const struct cred *cred;
> > > +
> > > +     /*
> > > +      * TODO: What do we log here?  I'm tossing in a few things to start the
> > > +      *       conversation, but additional thought needs to go into this.
> > > +      */
> > > +
> > > +     ab = audit_log_start(ctx, GFP_KERNEL, AUDIT_URINGOP);
> > > +     if (!ab)
> > > +             return;
> > > +     cred = current_cred();
> >
> > This may need to be req->work.creds.  I haven't been following if the
> > io_uring thread inherited the user task's creds (and below, comm and
> > exe).
> 
> Nope, we're good.  See the existing code in io_issue_sqe() :)
> 
> > > +     audit_log_format(ab, "uring_op=%d", ctx->uring_op);
> >
> > arch is stored below in __audit_uring_entry() and never used in the
> > AUDIT_CTX_URING case.  That assignment can either be dropped or printed
> > before uring_op similar to the SYSCALL record.
> 
> Good point, I'll drop the code that records the arch from _entry(); it
> is really only useful to give the appropriate context if needed for
> other things in the audit stream, and that isn't the case like it is
> with syscalls.
> 
> > There aren't really any arg[0-3] to print.
> 
> Which is why I didn't print them.
> 
> > io_uring_register and io_uring_setup() args are better covered by other
> > records.  io_uring_enter() has 6 args and the last two aren't covered by
> > SYSCALL anyways.
> 
>  ???
> 
> I think you are confusing the io_uring ops with syscalls; they are
> very different things from an audit perspective and the io_uring
> auditing is not intended to replace syscall records.  The
> io_uring_setup() and io_uring_enter() syscalls will be auditing just
> as any other syscalls would be using the existing syscall audit code.
> 
> > > +     if (ctx->return_valid != AUDITSC_INVALID)
> > > +             audit_log_format(ab, " success=%s exit=%ld",
> > > +                              (ctx->return_valid == AUDITSC_SUCCESS ?
> > > +                               "yes" : "no"),
> > > +                              ctx->return_code);
> > > +     audit_log_format(ab,
> > > +                      " items=%d"
> > > +                      " ppid=%d pid=%d auid=%u uid=%u gid=%u"
> > > +                      " euid=%u suid=%u fsuid=%u"
> > > +                      " egid=%u sgid=%u fsgid=%u",
> > > +                      ctx->name_count,
> > > +                      task_ppid_nr(current),
> > > +                      task_tgid_nr(current),
> > > +                      from_kuid(&init_user_ns, audit_get_loginuid(current)),
> > > +                      from_kuid(&init_user_ns, cred->uid),
> > > +                      from_kgid(&init_user_ns, cred->gid),
> > > +                      from_kuid(&init_user_ns, cred->euid),
> > > +                      from_kuid(&init_user_ns, cred->suid),
> > > +                      from_kuid(&init_user_ns, cred->fsuid),
> > > +                      from_kgid(&init_user_ns, cred->egid),
> > > +                      from_kgid(&init_user_ns, cred->sgid),
> > > +                      from_kgid(&init_user_ns, cred->fsgid));
> >
> > The audit session ID is still important, relevant and qualifies auid.
> > In keeping with the SYSCALL record format, I think we want to keep
> > ses=audit_get_sessionid(current) in here.
> 
> This might be another case of syscall/io_uring confusion.  An io_uring
> op doesn't necessarily have an audit session ID or an audit UID in the
> conventional sense; for example think about SQPOLL works, shared
> rings, etc.

Right, but those syscalls are what instigate io_uring operations, so
whatever process starts that operation, or gets handed that handle
should be tracked with auid and sessionid (the two work together to
track) unless we can easily track io_uring ops to connect them to a
previous setup syscall.  If we see a need to keep the auid, then the
sessionid goes with it.

> > I'm pretty sure we also want to keep comm= and exe= too, but may have to
> > reach into req->task to get it.  There are two values for comm possible,
> > one from the original task and second "iou-sqp-<pid>" set at the top of
> > io_sq_thread().
> 
> I think this is more syscall/io_uring confusion.

I wouldn't call them confusion but rather parallels, attributing a
particular subject to an action.

> paul moore

- RGB

--
Richard Guy Briggs <rgb@redhat.com>
Sr. S/W Engineer, Kernel Security, Base Operating Systems
Remote, Ottawa, Red Hat Canada
IRC: rgb, SunRaycer
Voice: +1.647.777.2635, Internal: (81) 32635
Paul Moore Aug. 25, 2021, 7:41 p.m. UTC | #40
On Tue, Aug 24, 2021 at 9:21 PM Richard Guy Briggs <rgb@redhat.com> wrote:
>
> On 2021-06-02 13:46, Paul Moore wrote:
> > On Wed, Jun 2, 2021 at 1:29 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> > > On 2021-05-21 17:49, Paul Moore wrote:
> > > > WARNING - This is a work in progress and should not be merged
> > > > anywhere important.  It is almost surely not complete, and while it
> > > > probably compiles it likely hasn't been booted and will do terrible
> > > > things.  You have been warned.
> > > >
> > > > This patch adds basic auditing to io_uring operations, regardless of
> > > > their context.  This is accomplished by allocating audit_context
> > > > structures for the io-wq worker and io_uring SQPOLL kernel threads
> > > > as well as explicitly auditing the io_uring operations in
> > > > io_issue_sqe().  The io_uring operations are audited using a new
> > > > AUDIT_URINGOP record, an example is shown below:
> > > >
> > > >   % <TODO - insert AUDIT_URINGOP record example>
> > > >
> > > > Thanks to Richard Guy Briggs for review and feedback.
> > > >
> > > > Signed-off-by: Paul Moore <paul@paul-moore.com>
> > > > ---
> > > >  fs/io-wq.c                 |    4 +
> > > >  fs/io_uring.c              |   11 +++
> > > >  include/linux/audit.h      |   17 ++++
> > > >  include/uapi/linux/audit.h |    1
> > > >  kernel/audit.h             |    2 +
> > > >  kernel/auditsc.c           |  173 ++++++++++++++++++++++++++++++++++++++++++++
> > > >  6 files changed, 208 insertions(+)

...

> > > > +     if (ctx->return_valid != AUDITSC_INVALID)
> > > > +             audit_log_format(ab, " success=%s exit=%ld",
> > > > +                              (ctx->return_valid == AUDITSC_SUCCESS ?
> > > > +                               "yes" : "no"),
> > > > +                              ctx->return_code);
> > > > +     audit_log_format(ab,
> > > > +                      " items=%d"
> > > > +                      " ppid=%d pid=%d auid=%u uid=%u gid=%u"
> > > > +                      " euid=%u suid=%u fsuid=%u"
> > > > +                      " egid=%u sgid=%u fsgid=%u",
> > > > +                      ctx->name_count,
> > > > +                      task_ppid_nr(current),
> > > > +                      task_tgid_nr(current),
> > > > +                      from_kuid(&init_user_ns, audit_get_loginuid(current)),
> > > > +                      from_kuid(&init_user_ns, cred->uid),
> > > > +                      from_kgid(&init_user_ns, cred->gid),
> > > > +                      from_kuid(&init_user_ns, cred->euid),
> > > > +                      from_kuid(&init_user_ns, cred->suid),
> > > > +                      from_kuid(&init_user_ns, cred->fsuid),
> > > > +                      from_kgid(&init_user_ns, cred->egid),
> > > > +                      from_kgid(&init_user_ns, cred->sgid),
> > > > +                      from_kgid(&init_user_ns, cred->fsgid));
> > >
> > > The audit session ID is still important, relevant and qualifies auid.
> > > In keeping with the SYSCALL record format, I think we want to keep
> > > ses=audit_get_sessionid(current) in here.
> >
> > This might be another case of syscall/io_uring confusion.  An io_uring
> > op doesn't necessarily have an audit session ID or an audit UID in the
> > conventional sense; for example think about SQPOLL works, shared
> > rings, etc.
>
> Right, but those syscalls are what instigate io_uring operations, so
> whatever process starts that operation, or gets handed that handle
> should be tracked with auid and sessionid (the two work together to
> track) unless we can easily track io_uring ops to connect them to a
> previous setup syscall.  If we see a need to keep the auid, then the
> sessionid goes with it.

As a reminder, once the io_uring is created appropriately one can
issue io_uring operations without making a syscall.  Further, sharing
a io_uring across process boundaries means that both the audit session
ID and audit login UID used to create the io_uring might not be the
same as the subject which issues operations to the io_uring.

Any io_uring operations that happen synchronously as the result of a
syscall should be associated with the SYSCALL record so the session ID
and login UID will be part of the event.  Asynchronous operations will
not have that information because we don't have a way to get it.
diff mbox series

Patch

diff --git a/fs/io-wq.c b/fs/io-wq.c
index 5361a9b4b47b..8af09a3336e0 100644
--- a/fs/io-wq.c
+++ b/fs/io-wq.c
@@ -16,6 +16,7 @@ 
 #include <linux/rculist_nulls.h>
 #include <linux/cpu.h>
 #include <linux/tracehook.h>
+#include <linux/audit.h>
 
 #include "io-wq.h"
 
@@ -535,6 +536,8 @@  static int io_wqe_worker(void *data)
 	snprintf(buf, sizeof(buf), "iou-wrk-%d", wq->task->pid);
 	set_task_comm(current, buf);
 
+	audit_alloc_kernel(current);
+
 	while (!test_bit(IO_WQ_BIT_EXIT, &wq->state)) {
 		long ret;
 
@@ -573,6 +576,7 @@  static int io_wqe_worker(void *data)
 			raw_spin_unlock_irq(&wqe->lock);
 	}
 
+	audit_free(current);
 	io_worker_exit(worker);
 	return 0;
 }
diff --git a/fs/io_uring.c b/fs/io_uring.c
index e481ac8a757a..e9941d1ad8fd 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -78,6 +78,7 @@ 
 #include <linux/task_work.h>
 #include <linux/pagemap.h>
 #include <linux/io_uring.h>
+#include <linux/audit.h>
 
 #define CREATE_TRACE_POINTS
 #include <trace/events/io_uring.h>
@@ -6105,6 +6106,9 @@  static int io_issue_sqe(struct io_kiocb *req, unsigned int issue_flags)
 	if (req->work.creds && req->work.creds != current_cred())
 		creds = override_creds(req->work.creds);
 
+	if (req->opcode < IORING_OP_LAST)
+		audit_uring_entry(req->opcode);
+
 	switch (req->opcode) {
 	case IORING_OP_NOP:
 		ret = io_nop(req, issue_flags);
@@ -6211,6 +6215,9 @@  static int io_issue_sqe(struct io_kiocb *req, unsigned int issue_flags)
 		break;
 	}
 
+	if (req->opcode < IORING_OP_LAST)
+		audit_uring_exit(!ret, ret);
+
 	if (creds)
 		revert_creds(creds);
 
@@ -6827,6 +6834,8 @@  static int io_sq_thread(void *data)
 		set_cpus_allowed_ptr(current, cpu_online_mask);
 	current->flags |= PF_NO_SETAFFINITY;
 
+	audit_alloc_kernel(current);
+
 	mutex_lock(&sqd->lock);
 	/* a user may had exited before the thread started */
 	io_run_task_work_head(&sqd->park_task_work);
@@ -6916,6 +6925,8 @@  static int io_sq_thread(void *data)
 	io_run_task_work_head(&sqd->park_task_work);
 	mutex_unlock(&sqd->lock);
 
+	audit_free(current);
+
 	complete(&sqd->exited);
 	do_exit(0);
 }
diff --git a/include/linux/audit.h b/include/linux/audit.h
index 82b7c1116a85..6a0c013bc7de 100644
--- a/include/linux/audit.h
+++ b/include/linux/audit.h
@@ -286,7 +286,10 @@  static inline int audit_signal_info(int sig, struct task_struct *t)
 /* These are defined in auditsc.c */
 				/* Public API */
 extern int  audit_alloc(struct task_struct *task);
+extern int  audit_alloc_kernel(struct task_struct *task);
 extern void __audit_free(struct task_struct *task);
+extern void __audit_uring_entry(u8 op);
+extern void __audit_uring_exit(int success, long code);
 extern void __audit_syscall_entry(int major, unsigned long a0, unsigned long a1,
 				  unsigned long a2, unsigned long a3);
 extern void __audit_syscall_exit(int ret_success, long ret_value);
@@ -323,6 +326,16 @@  static inline void audit_free(struct task_struct *task)
 	if (unlikely(task->audit_context))
 		__audit_free(task);
 }
+static inline void audit_uring_entry(u8 op)
+{
+	if (unlikely(audit_context()))
+		__audit_uring_entry(op);
+}
+static inline void audit_uring_exit(int success, long code)
+{
+	if (unlikely(audit_context()))
+		__audit_uring_exit(success, code);
+}
 static inline void audit_syscall_entry(int major, unsigned long a0,
 				       unsigned long a1, unsigned long a2,
 				       unsigned long a3)
@@ -554,6 +567,10 @@  static inline int audit_alloc(struct task_struct *task)
 {
 	return 0;
 }
+static inline int audit_alloc_kernel(struct task_struct *task)
+{
+	return 0;
+}
 static inline void audit_free(struct task_struct *task)
 { }
 static inline void audit_syscall_entry(int major, unsigned long a0,
diff --git a/include/uapi/linux/audit.h b/include/uapi/linux/audit.h
index cd2d8279a5e4..b26e0c435e8b 100644
--- a/include/uapi/linux/audit.h
+++ b/include/uapi/linux/audit.h
@@ -118,6 +118,7 @@ 
 #define AUDIT_TIME_ADJNTPVAL	1333	/* NTP value adjustment */
 #define AUDIT_BPF		1334	/* BPF subsystem */
 #define AUDIT_EVENT_LISTENER	1335	/* Task joined multicast read socket */
+#define AUDIT_URINGOP		1336	/* io_uring operation */
 
 #define AUDIT_AVC		1400	/* SE Linux avc denial or grant */
 #define AUDIT_SELINUX_ERR	1401	/* Internal SE Linux Errors */
diff --git a/kernel/audit.h b/kernel/audit.h
index fba180de5912..50de827497ca 100644
--- a/kernel/audit.h
+++ b/kernel/audit.h
@@ -100,10 +100,12 @@  struct audit_context {
 	enum {
 		AUDIT_CTX_UNUSED,	/* audit_context is currently unused */
 		AUDIT_CTX_SYSCALL,	/* in use by syscall */
+		AUDIT_CTX_URING,	/* in use by io_uring */
 	} context;
 	enum audit_state    state, current_state;
 	unsigned int	    serial;     /* serial number for record */
 	int		    major;      /* syscall number */
+	int		    uring_op;   /* uring operation */
 	struct timespec64   ctime;      /* time of syscall entry */
 	unsigned long	    argv[4];    /* syscall arguments */
 	long		    return_code;/* syscall return code */
diff --git a/kernel/auditsc.c b/kernel/auditsc.c
index cc89e9f9a753..729849d41631 100644
--- a/kernel/auditsc.c
+++ b/kernel/auditsc.c
@@ -953,6 +953,7 @@  static void audit_reset_context(struct audit_context *ctx)
 	ctx->current_state = ctx->state;
 	ctx->serial = 0;
 	ctx->major = 0;
+	ctx->uring_op = 0;
 	ctx->ctime = (struct timespec64){ .tv_sec = 0, .tv_nsec = 0 };
 	memset(ctx->argv, 0, sizeof(ctx->argv));
 	ctx->return_code = 0;
@@ -1038,6 +1039,31 @@  int audit_alloc(struct task_struct *tsk)
 	return 0;
 }
 
+/**
+ * audit_alloc_kernel - allocate an audit_context for a kernel task
+ * @tsk: the kernel task
+ *
+ * Similar to the audit_alloc() function, but intended for kernel private
+ * threads.  Returns zero on success, negative values on failure.
+ */
+int audit_alloc_kernel(struct task_struct *tsk)
+{
+	/*
+	 * At the moment we are just going to call into audit_alloc() to
+	 * simplify the code, but there two things to keep in mind with this
+	 * approach:
+	 *
+	 * 1. Filtering internal kernel tasks is a bit laughable in almost all
+	 * cases, but there is at least one case where there is a benefit:
+	 * the '-a task,never' case allows the admin to effectively disable
+	 * task auditing at runtime.
+	 *
+	 * 2. The {set,clear}_task_syscall_work() ops likely have zero effect
+	 * on these internal kernel tasks, but they probably don't hurt either.
+	 */
+	return audit_alloc(tsk);
+}
+
 static inline void audit_free_context(struct audit_context *context)
 {
 	/* resetting is extra work, but it is likely just noise */
@@ -1536,6 +1562,52 @@  static void audit_log_proctitle(void)
 	audit_log_end(ab);
 }
 
+/**
+ * audit_log_uring - generate a AUDIT_URINGOP record
+ * @ctx: the audit context
+ */
+static void audit_log_uring(struct audit_context *ctx)
+{
+	struct audit_buffer *ab;
+	const struct cred *cred;
+
+	/*
+	 * TODO: What do we log here?  I'm tossing in a few things to start the
+	 *       conversation, but additional thought needs to go into this.
+	 */
+
+	ab = audit_log_start(ctx, GFP_KERNEL, AUDIT_URINGOP);
+	if (!ab)
+		return;
+	cred = current_cred();
+	audit_log_format(ab, "uring_op=%d", ctx->uring_op);
+	if (ctx->return_valid != AUDITSC_INVALID)
+		audit_log_format(ab, " success=%s exit=%ld",
+				 (ctx->return_valid == AUDITSC_SUCCESS ?
+				  "yes" : "no"),
+				 ctx->return_code);
+	audit_log_format(ab,
+			 " items=%d"
+			 " ppid=%d pid=%d auid=%u uid=%u gid=%u"
+			 " euid=%u suid=%u fsuid=%u"
+			 " egid=%u sgid=%u fsgid=%u",
+			 ctx->name_count,
+			 task_ppid_nr(current),
+			 task_tgid_nr(current),
+			 from_kuid(&init_user_ns, audit_get_loginuid(current)),
+			 from_kuid(&init_user_ns, cred->uid),
+			 from_kgid(&init_user_ns, cred->gid),
+			 from_kuid(&init_user_ns, cred->euid),
+			 from_kuid(&init_user_ns, cred->suid),
+			 from_kuid(&init_user_ns, cred->fsuid),
+			 from_kgid(&init_user_ns, cred->egid),
+			 from_kgid(&init_user_ns, cred->sgid),
+			 from_kgid(&init_user_ns, cred->fsgid));
+	audit_log_task_context(ab);
+	audit_log_key(ab, ctx->filterkey);
+	audit_log_end(ab);
+}
+
 static void audit_log_exit(void)
 {
 	int i, call_panic = 0;
@@ -1571,6 +1643,9 @@  static void audit_log_exit(void)
 		audit_log_key(ab, context->filterkey);
 		audit_log_end(ab);
 		break;
+	case AUDIT_CTX_URING:
+		audit_log_uring(context);
+		break;
 	default:
 		BUG();
 		break;
@@ -1740,6 +1815,104 @@  static void audit_return_fixup(struct audit_context *ctx,
 	ctx->return_valid = (success ? AUDITSC_SUCCESS : AUDITSC_FAILURE);
 }
 
+/**
+ * __audit_uring_entry - prepare the kernel task's audit context for io_uring
+ * @op: the io_uring opcode
+ *
+ * This is similar to audit_syscall_entry() but is intended for use by io_uring
+ * operations.
+ */
+void __audit_uring_entry(u8 op)
+{
+	struct audit_context *ctx = audit_context();
+
+	if (!audit_enabled || !ctx || ctx->state == AUDIT_DISABLED)
+		return;
+
+	/*
+	 * NOTE: It's possible that we can be called from the process' context
+	 *       before it returns to userspace, and before audit_syscall_exit()
+	 *       is called.  In this case there is not much to do, just record
+	 *       the io_uring details and return.
+	 */
+	ctx->uring_op = op;
+	if (ctx->context == AUDIT_CTX_SYSCALL)
+		return;
+
+	ctx->dummy = !audit_n_rules;
+	if (!ctx->dummy && ctx->state == AUDIT_BUILD_CONTEXT)
+		ctx->prio = 0;
+
+	ctx->arch = syscall_get_arch(current);
+	ctx->context = AUDIT_CTX_URING;
+	ctx->current_state = ctx->state;
+	ktime_get_coarse_real_ts64(&ctx->ctime);
+}
+
+/**
+ * __audit_uring_exit - wrap up the kernel task's audit context after io_uring
+ * @success: true/false value to indicate if the operation succeeded or not
+ * @code: operation return code
+ *
+ * This is similar to audit_syscall_exit() but is intended for use by io_uring
+ * operations.
+ */
+void __audit_uring_exit(int success, long code)
+{
+	struct audit_context *ctx = audit_context();
+
+	/*
+	 * TODO: At some point we will likely want to filter on io_uring ops
+	 *       and other things similar to what we do for syscalls, but that
+	 *       is something for another day; just record what we can here.
+	 */
+
+	if (!ctx || ctx->dummy)
+		goto out;
+	if (ctx->context == AUDIT_CTX_SYSCALL) {
+		/*
+		 * NOTE: See the note in __audit_uring_entry() about the case
+		 *       where we may be called from process context before we
+		 *       return to userspace via audit_syscall_exit().  In this
+		 *       case we simply emit a URINGOP record and bail, the
+		 *       normal syscall exit handling will take care of
+		 *       everything else.
+		 *       It is also worth mentioning that when we are called,
+		 *       the current process creds may differ from the creds
+		 *       used during the normal syscall processing; keep that
+		 *       in mind if/when we move the record generation code.
+		 */
+
+		/*
+		 * We need to filter on the syscall info here to decide if we
+		 * should emit a URINGOP record.  I know it seems odd but this
+		 * solves the problem where users have a filter to block *all*
+		 * syscall records in the "exit" filter; we want to preserve
+		 * the behavior here.
+		 */
+		audit_filter_syscall(current, ctx);
+		audit_filter_inodes(current, ctx);
+		if (ctx->current_state != AUDIT_RECORD_CONTEXT)
+			return;
+
+		audit_log_uring(ctx);
+		return;
+	}
+
+	/* this may generate CONFIG_CHANGE records */
+	if (!list_empty(&ctx->killed_trees))
+		audit_kill_trees(ctx);
+
+	audit_filter_inodes(current, ctx);
+	if (ctx->current_state != AUDIT_RECORD_CONTEXT)
+		goto out;
+	audit_return_fixup(ctx, success, code);
+	audit_log_exit();
+
+out:
+	audit_reset_context(ctx);
+}
+
 /**
  * __audit_syscall_entry - fill in an audit record at syscall entry
  * @major: major syscall type (function)