Message ID | 162163379461.8379.9691291608621179559.stgit@sifl (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Add LSM access controls and auditing to io_uring | expand |
On 5/21/21 10:49 PM, Paul Moore wrote: > WARNING - This is a work in progress and should not be merged > anywhere important. It is almost surely not complete, and while it > probably compiles it likely hasn't been booted and will do terrible > things. You have been warned. > > This patch adds basic auditing to io_uring operations, regardless of > their context. This is accomplished by allocating audit_context > structures for the io-wq worker and io_uring SQPOLL kernel threads > as well as explicitly auditing the io_uring operations in > io_issue_sqe(). The io_uring operations are audited using a new > AUDIT_URINGOP record, an example is shown below: > > % <TODO - insert AUDIT_URINGOP record example> > > Thanks to Richard Guy Briggs for review and feedback. > > Signed-off-by: Paul Moore <paul@paul-moore.com> > --- [...] > diff --git a/fs/io_uring.c b/fs/io_uring.c > index e481ac8a757a..e9941d1ad8fd 100644 > --- a/fs/io_uring.c > +++ b/fs/io_uring.c > @@ -78,6 +78,7 @@ > #include <linux/task_work.h> > #include <linux/pagemap.h> > #include <linux/io_uring.h> > +#include <linux/audit.h> > > #define CREATE_TRACE_POINTS > #include <trace/events/io_uring.h> > @@ -6105,6 +6106,9 @@ static int io_issue_sqe(struct io_kiocb *req, unsigned int issue_flags) > if (req->work.creds && req->work.creds != current_cred()) > creds = override_creds(req->work.creds); > > + if (req->opcode < IORING_OP_LAST) always true at this point > + audit_uring_entry(req->opcode); So, it adds two if's with memory loads (i.e. current->audit_context) per request in one of the hottest functions here... No way, nack Maybe, if it's dynamically compiled into like kprobes if it's _really_ used. > + > switch (req->opcode) { > case IORING_OP_NOP: > ret = io_nop(req, issue_flags); > @@ -6211,6 +6215,9 @@ static int io_issue_sqe(struct io_kiocb *req, unsigned int issue_flags) > break; > } > > + if (req->opcode < IORING_OP_LAST) > + audit_uring_exit(!ret, ret); > + > if (creds) > revert_creds(creds);
On Fri, May 21, 2021 at 8:22 PM Pavel Begunkov <asml.silence@gmail.com> wrote: > On 5/21/21 10:49 PM, Paul Moore wrote: > > WARNING - This is a work in progress and should not be merged > > anywhere important. It is almost surely not complete, and while it > > probably compiles it likely hasn't been booted and will do terrible > > things. You have been warned. > > > > This patch adds basic auditing to io_uring operations, regardless of > > their context. This is accomplished by allocating audit_context > > structures for the io-wq worker and io_uring SQPOLL kernel threads > > as well as explicitly auditing the io_uring operations in > > io_issue_sqe(). The io_uring operations are audited using a new > > AUDIT_URINGOP record, an example is shown below: > > > > % <TODO - insert AUDIT_URINGOP record example> > > > > Thanks to Richard Guy Briggs for review and feedback. > > > > Signed-off-by: Paul Moore <paul@paul-moore.com> > > --- > [...] > > diff --git a/fs/io_uring.c b/fs/io_uring.c > > index e481ac8a757a..e9941d1ad8fd 100644 > > --- a/fs/io_uring.c > > +++ b/fs/io_uring.c > > @@ -78,6 +78,7 @@ > > #include <linux/task_work.h> > > #include <linux/pagemap.h> > > #include <linux/io_uring.h> > > +#include <linux/audit.h> > > > > #define CREATE_TRACE_POINTS > > #include <trace/events/io_uring.h> > > @@ -6105,6 +6106,9 @@ static int io_issue_sqe(struct io_kiocb *req, unsigned int issue_flags) > > if (req->work.creds && req->work.creds != current_cred()) > > creds = override_creds(req->work.creds); > > > > + if (req->opcode < IORING_OP_LAST) > > always true at this point I placed the opcode check before the audit call because the switch statement below which handles the operation dispatching has a 'ret = -EINVAL' for the default case, implying that there are some paths where an invalid opcode could be passed into the function. Obviously if that is not the case and you can guarantee that req->opcode will always be valid we can easily drop the check prior to the audit call. > > + audit_uring_entry(req->opcode); > > So, it adds two if's with memory loads (i.e. current->audit_context) > per request in one of the hottest functions here... No way, nack > > Maybe, if it's dynamically compiled into like kprobes if it's > _really_ used. I'm open to suggestions on how to tweak the io_uring/audit integration, if you don't like what I've proposed in this patchset, lets try to come up with a solution that is more palatable. If you were going to add audit support for these io_uring operations, how would you propose we do it? Not being able to properly audit io_uring operations is going to be a significant issue for a chunk of users, if it isn't already, we need to work to find a solution to this problem. Unfortunately I don't think dynamically inserting audit calls is something that would meet the needs of the audit community (I fear it would run afoul of the various security certifications), and it definitely isn't something that we support at present.
On 5/22/21 3:36 AM, Paul Moore wrote: > On Fri, May 21, 2021 at 8:22 PM Pavel Begunkov <asml.silence@gmail.com> wrote: >> On 5/21/21 10:49 PM, Paul Moore wrote: [...] >>> >>> + if (req->opcode < IORING_OP_LAST) >> >> always true at this point > > I placed the opcode check before the audit call because the switch > statement below which handles the operation dispatching has a 'ret = > -EINVAL' for the default case, implying that there are some paths > where an invalid opcode could be passed into the function. Obviously > if that is not the case and you can guarantee that req->opcode will > always be valid we can easily drop the check prior to the audit call. It is always true at this point, would be completely broken otherwise >>> + audit_uring_entry(req->opcode); >> >> So, it adds two if's with memory loads (i.e. current->audit_context) >> per request in one of the hottest functions here... No way, nack >> >> Maybe, if it's dynamically compiled into like kprobes if it's >> _really_ used. > > I'm open to suggestions on how to tweak the io_uring/audit > integration, if you don't like what I've proposed in this patchset, > lets try to come up with a solution that is more palatable. If you > were going to add audit support for these io_uring operations, how > would you propose we do it? Not being able to properly audit io_uring > operations is going to be a significant issue for a chunk of users, if > it isn't already, we need to work to find a solution to this problem. Who knows. First of all, seems CONFIG_AUDIT is enabled by default for many popular distributions, so I assume that is not compiled out. What are use cases for audit? Always running I guess? Putting aside compatibility problems, it sounds that with the amount of overhead it adds there is no much profit in using io_uring in the first place. Is that so? __audit_uring_exit() -> audit_filter_syscall() -> for (audit_list) if (...) audit_filter_rules() -> ... -> audit_filter_inodes() -> ... > Unfortunately I don't think dynamically inserting audit calls is > something that would meet the needs of the audit community (I fear it > would run afoul of the various security certifications), and it > definitely isn't something that we support at present. I see
On Sun, May 23, 2021 at 4:26 PM Pavel Begunkov <asml.silence@gmail.com> wrote: > On 5/22/21 3:36 AM, Paul Moore wrote: > > On Fri, May 21, 2021 at 8:22 PM Pavel Begunkov <asml.silence@gmail.com> wrote: > >> On 5/21/21 10:49 PM, Paul Moore wrote: > [...] > >>> > >>> + if (req->opcode < IORING_OP_LAST) > >> > >> always true at this point > > > > I placed the opcode check before the audit call because the switch > > statement below which handles the operation dispatching has a 'ret = > > -EINVAL' for the default case, implying that there are some paths > > where an invalid opcode could be passed into the function. Obviously > > if that is not the case and you can guarantee that req->opcode will > > always be valid we can easily drop the check prior to the audit call. > > It is always true at this point, would be completely broken > otherwise Understood, I was just pointing out an oddity in the code. I just dropped the checks from my local tree, you'll see it in the next draft of the patchset. > >> So, it adds two if's with memory loads (i.e. current->audit_context) > >> per request in one of the hottest functions here... No way, nack > >> > >> Maybe, if it's dynamically compiled into like kprobes if it's > >> _really_ used. > > > > I'm open to suggestions on how to tweak the io_uring/audit > > integration, if you don't like what I've proposed in this patchset, > > lets try to come up with a solution that is more palatable. If you > > were going to add audit support for these io_uring operations, how > > would you propose we do it? Not being able to properly audit io_uring > > operations is going to be a significant issue for a chunk of users, if > > it isn't already, we need to work to find a solution to this problem. > > Who knows. First of all, seems CONFIG_AUDIT is enabled by default > for many popular distributions, so I assume that is not compiled out. > > What are use cases for audit? Always running I guess? Audit has been around for quite some time now, and it's goal is to provide a mechanism for logging "security relevant" information in such a way that it meets the needs of various security certification efforts. Traditional Linux event logging, e.g. syslog and the like, does not meet these requirements and changing them would likely affect the usability for those who are not required to be compliant with these security certifications. The Linux audit subsystem allows Linux to be used in places it couldn't be used otherwise (or rather makes it a *lot* easier). That said, audit is not for everyone, and we have build time and runtime options to help make life easier. Beyond simply disabling audit at compile time a number of Linux distributions effectively shortcut audit at runtime by adding a "never" rule to the audit filter, for example: % auditctl -a task,never > Putting aside compatibility problems, it sounds that with the amount of overhead > it adds there is no much profit in using io_uring in the first place. > Is that so? Well, if audit alone erased all of the io_uring advantages we should just rip io_uring out of the kernel and people can just disable audit instead ;) I believe there are people who would like to use io_uring and are also required to use a kernel with audit, either due to the need to run a distribution kernel or the need to capture security information in the audit stream. I'm hoping that we can find a solution for these users; if we don't we are asking this group to choose either io_uring or audit, and that is something I would like to avoid.
On 5/24/21 8:59 PM, Paul Moore wrote: > On Sun, May 23, 2021 at 4:26 PM Pavel Begunkov <asml.silence@gmail.com> wrote: >> On 5/22/21 3:36 AM, Paul Moore wrote: >>> On Fri, May 21, 2021 at 8:22 PM Pavel Begunkov <asml.silence@gmail.com> wrote: >>>> On 5/21/21 10:49 PM, Paul Moore wrote: >> [...] >>>>> >>>>> + if (req->opcode < IORING_OP_LAST) >>>> >>>> always true at this point >>> >>> I placed the opcode check before the audit call because the switch >>> statement below which handles the operation dispatching has a 'ret = >>> -EINVAL' for the default case, implying that there are some paths >>> where an invalid opcode could be passed into the function. Obviously >>> if that is not the case and you can guarantee that req->opcode will >>> always be valid we can easily drop the check prior to the audit call. >> >> It is always true at this point, would be completely broken >> otherwise > > Understood, I was just pointing out an oddity in the code. I just > dropped the checks from my local tree, you'll see it in the next draft > of the patchset. > >>>> So, it adds two if's with memory loads (i.e. current->audit_context) >>>> per request in one of the hottest functions here... No way, nack >>>> >>>> Maybe, if it's dynamically compiled into like kprobes if it's >>>> _really_ used. >>> >>> I'm open to suggestions on how to tweak the io_uring/audit >>> integration, if you don't like what I've proposed in this patchset, >>> lets try to come up with a solution that is more palatable. If you >>> were going to add audit support for these io_uring operations, how >>> would you propose we do it? Not being able to properly audit io_uring >>> operations is going to be a significant issue for a chunk of users, if >>> it isn't already, we need to work to find a solution to this problem. >> >> Who knows. First of all, seems CONFIG_AUDIT is enabled by default >> for many popular distributions, so I assume that is not compiled out. >> >> What are use cases for audit? Always running I guess? > > Audit has been around for quite some time now, and it's goal is to > provide a mechanism for logging "security relevant" information in > such a way that it meets the needs of various security certification > efforts. Traditional Linux event logging, e.g. syslog and the like, > does not meet these requirements and changing them would likely affect > the usability for those who are not required to be compliant with > these security certifications. The Linux audit subsystem allows Linux > to be used in places it couldn't be used otherwise (or rather makes it > a *lot* easier). > > That said, audit is not for everyone, and we have build time and > runtime options to help make life easier. Beyond simply disabling > audit at compile time a number of Linux distributions effectively > shortcut audit at runtime by adding a "never" rule to the audit > filter, for example: > > % auditctl -a task,never > >> Putting aside compatibility problems, it sounds that with the amount of overhead >> it adds there is no much profit in using io_uring in the first place. >> Is that so? > > Well, if audit alone erased all of the io_uring advantages we should > just rip io_uring out of the kernel and people can just disable audit > instead ;) Hah, if we add a simple idle "feature" like for (i=0;i<1000000;i+) {;} and it would destroy all the performance, let's throw useless Linux kernel then! If seriously, it's more of an open question to me, how much overhead it adds if enabled (with typical filters/options/etc). Btw, do you really need two hooks -- before and right after execution? > I believe there are people who would like to use io_uring and are also > required to use a kernel with audit, either due to the need to run a > distribution kernel or the need to capture security information in the > audit stream. I'm hoping that we can find a solution for these users; > if we don't we are asking this group to choose either io_uring or > audit, and that is something I would like to avoid.
On Tue, May 25, 2021 at 4:27 AM Pavel Begunkov <asml.silence@gmail.com> wrote: > On 5/24/21 8:59 PM, Paul Moore wrote: > > On Sun, May 23, 2021 at 4:26 PM Pavel Begunkov <asml.silence@gmail.com> wrote: > >> On 5/22/21 3:36 AM, Paul Moore wrote: > >>> On Fri, May 21, 2021 at 8:22 PM Pavel Begunkov <asml.silence@gmail.com> wrote: > >>>> On 5/21/21 10:49 PM, Paul Moore wrote: > >> [...] > >>>>> > >>>>> + if (req->opcode < IORING_OP_LAST) > >>>> > >>>> always true at this point > >>> > >>> I placed the opcode check before the audit call because the switch > >>> statement below which handles the operation dispatching has a 'ret = > >>> -EINVAL' for the default case, implying that there are some paths > >>> where an invalid opcode could be passed into the function. Obviously > >>> if that is not the case and you can guarantee that req->opcode will > >>> always be valid we can easily drop the check prior to the audit call. > >> > >> It is always true at this point, would be completely broken > >> otherwise > > > > Understood, I was just pointing out an oddity in the code. I just > > dropped the checks from my local tree, you'll see it in the next draft > > of the patchset. > > > >>>> So, it adds two if's with memory loads (i.e. current->audit_context) > >>>> per request in one of the hottest functions here... No way, nack > >>>> > >>>> Maybe, if it's dynamically compiled into like kprobes if it's > >>>> _really_ used. > >>> > >>> I'm open to suggestions on how to tweak the io_uring/audit > >>> integration, if you don't like what I've proposed in this patchset, > >>> lets try to come up with a solution that is more palatable. If you > >>> were going to add audit support for these io_uring operations, how > >>> would you propose we do it? Not being able to properly audit io_uring > >>> operations is going to be a significant issue for a chunk of users, if > >>> it isn't already, we need to work to find a solution to this problem. > >> > >> Who knows. First of all, seems CONFIG_AUDIT is enabled by default > >> for many popular distributions, so I assume that is not compiled out. > >> > >> What are use cases for audit? Always running I guess? > > > > Audit has been around for quite some time now, and it's goal is to > > provide a mechanism for logging "security relevant" information in > > such a way that it meets the needs of various security certification > > efforts. Traditional Linux event logging, e.g. syslog and the like, > > does not meet these requirements and changing them would likely affect > > the usability for those who are not required to be compliant with > > these security certifications. The Linux audit subsystem allows Linux > > to be used in places it couldn't be used otherwise (or rather makes it > > a *lot* easier). > > > > That said, audit is not for everyone, and we have build time and > > runtime options to help make life easier. Beyond simply disabling > > audit at compile time a number of Linux distributions effectively > > shortcut audit at runtime by adding a "never" rule to the audit > > filter, for example: > > > > % auditctl -a task,never > > > >> Putting aside compatibility problems, it sounds that with the amount of overhead > >> it adds there is no much profit in using io_uring in the first place. > >> Is that so? > > > > Well, if audit alone erased all of the io_uring advantages we should > > just rip io_uring out of the kernel and people can just disable audit > > instead ;) > > Hah, if we add a simple idle "feature" like > > for (i=0;i<1000000;i+) {;} > > and it would destroy all the performance, let's throw useless > Linux kernel then! > > If seriously, it's more of an open question to me, how much overhead > it adds if enabled (with typical filters/options/etc). I am very hesitant to define a "typical" audit configuration as it really is dependent on the user's security requirements as well as the particular solution/environment. Any configuration I pick will be "wrong" for a lot of audit users :) As I see it, users will likely find themselves in one of three performance buckets: * io_uring enabled, CONFIG_AUDIT=n For those who are already trying to get that last 1% of performance from their kernel and are willing to give up most everything else to get it this is the obvious choice. Needless to say there should not be any audit related impact in this case (especially since we've removed that req->opcode checks prior to the audit calls). * io_uring enabled, CONFIG_AUDIT=y, audit "task,never" runtime config [side note: I made some tweaks to audit_uring_entry() to move the audit_enabled check there so we no longer need to call into __audit_uring_entry() in this fastpath case. Similarly audit_uring_exit() now does an audit_dummy_context() check instead of simply checking audit_context(), this should avoid calling into __audit_uring_exit() when the io_uring op is not being audited.] I'm guessing that most distro users will fall into this bucket. Here the task's audit_context should always be NULL, both in the syscall context and sqpoll context, so io_uring would call into the inline audit_uring_entry() function and return without calling into __audit_uring_entry() (see the "side note" above). The audit_uring_exit() call would behave in a similar manner. * io_uring enabled, CONFIG_AUDIT=y, some sort of runtime audit config For obvious reasons this case has the most performance impact and as mentioned above it can vary quite a bit. In my opinion this is the least important bucket from a performance perspective as the user clearly has a need for the security information that audit provides and enabling that functionality in io_uring is critical to allowing the user to take advantage of io_uring. The performance of io_uring is impacted, but it should still be more performant than synchronous I/O and it allows the user to run their existing io_uring applications/code-paths. > Btw, do you really need two hooks -- before and right after > execution? Yes, the entry/before hook does the setup for the io_uring op (very similar to a syscall entry point) and the exit/after hook is what does the audit record generation based on what took place during the io_uring operation. > > I believe there are people who would like to use io_uring and are also > > required to use a kernel with audit, either due to the need to run a > > distribution kernel or the need to capture security information in the > > audit stream. I'm hoping that we can find a solution for these users; > > if we don't we are asking this group to choose either io_uring or > > audit, and that is something I would like to avoid. I'm leaving this old paragraph here because I really think this is important to the discussion. If we can't find a solution here we would need to make io_uring and audit mutually exclusive and I don't think that is in the best interests of the users, and would surely create a headache for the distros.
On 5/24/21 1:59 PM, Paul Moore wrote: > That said, audit is not for everyone, and we have build time and > runtime options to help make life easier. Beyond simply disabling > audit at compile time a number of Linux distributions effectively > shortcut audit at runtime by adding a "never" rule to the audit > filter, for example: > > % auditctl -a task,never As has been brought up, the issue we're facing is that distros have CONFIG_AUDIT=y and hence the above is the best real world case outside of people doing custom kernels. My question would then be how much overhead the above will add, considering it's an entry/exit call per op. If auditctl is turned off, what is the expectation in turns of overhead? My gut feeling tells me it's likely going to be too much. Keep in mind that we're sometimes doing millions of operations per second, per core. aio never had any audit logging as far as I can tell. I think it'd make a lot more sense to selectively enable audit logging only for opcodes that we care about. File open/create/unlink/mkdir etc, that kind of thing. File level operations that people would care about logging. Would they care about logging a buffer registration or a polled read from a device/file? I highly doubt it, and we don't do that for alternative methods either. Doesn't really make sense for a lot of the other operations, imho.
On Tue, May 25, 2021 at 9:11 PM Jens Axboe <axboe@kernel.dk> wrote: > On 5/24/21 1:59 PM, Paul Moore wrote: > > That said, audit is not for everyone, and we have build time and > > runtime options to help make life easier. Beyond simply disabling > > audit at compile time a number of Linux distributions effectively > > shortcut audit at runtime by adding a "never" rule to the audit > > filter, for example: > > > > % auditctl -a task,never > > As has been brought up, the issue we're facing is that distros have > CONFIG_AUDIT=y and hence the above is the best real world case outside > of people doing custom kernels. My question would then be how much > overhead the above will add, considering it's an entry/exit call per op. > If auditctl is turned off, what is the expectation in turns of overhead? I commented on that case in my last email to Pavel, but I'll try to go over it again in a little more detail. As we discussed earlier in this thread, we can skip the req->opcode check before both the _entry and _exit calls, so we are left with just the bare audit calls in the io_uring code. As the _entry and _exit functions are small, I've copied them and their supporting functions below and I'll try to explain what would happen in CONFIG_AUDIT=y, "task,never" case. + static inline struct audit_context *audit_context(void) + { + return current->audit_context; + } + static inline bool audit_dummy_context(void) + { + void *p = audit_context(); + return !p || *(int *)p; + } + static inline void audit_uring_entry(u8 op) + { + if (unlikely(audit_enabled && audit_context())) + __audit_uring_entry(op); + } We have one if statement where the conditional checks on two individual conditions. The first (audit_enabled) is simply a check to see if anyone has "turned on" auditing at runtime; historically this worked rather well, and still does in a number of places, but ever since systemd has taken to forcing audit on regardless of the admin's audit configuration it is less useful. The second (audit_context()) is a check to see if an audit_context has been allocated for the current task. In the case of "task,never" current->audit_context will be NULL (see audit_alloc()) and the __audit_uring_entry() slowpath will never be called. Worst case here is checking the value of audit_enabled and current->audit_context. Depending on which you think is more likely we can change the order of the check so that the current->audit_context check is first if you feel that is more likely to be NULL than audit_enabled is to be false (it may be that way now). + static inline void audit_uring_exit(int success, long code) + { + if (unlikely(!audit_dummy_context())) + __audit_uring_exit(success, code); + } The exit call is very similar to the entry call, but in the "task,never" case it is very simple as the first check to be performed is the current->audit_context check which we know to be NULL. The __audit_uring_exit() slowpath will never be called. > aio never had any audit logging as far as I can tell. I think it'd make > a lot more sense to selectively enable audit logging only for opcodes > that we care about. File open/create/unlink/mkdir etc, that kind of > thing. File level operations that people would care about logging. Would > they care about logging a buffer registration or a polled read from a > device/file? I highly doubt it, and we don't do that for alternative > methods either. Doesn't really make sense for a lot of the other > operations, imho. We would need to check with the current security requirements (there are distro people on the linux-audit list that keep track of that stuff), but looking at the opcodes right now my gut feeling is that most of the opcodes would be considered "security relevant" so selective auditing might not be that useful in practice. It would definitely clutter the code and increase the chances that new opcodes would not be properly audited when they are merged.
On 5/26/21 3:04 AM, Paul Moore wrote: > On Tue, May 25, 2021 at 9:11 PM Jens Axboe <axboe@kernel.dk> wrote: >> On 5/24/21 1:59 PM, Paul Moore wrote: >>> That said, audit is not for everyone, and we have build time and >>> runtime options to help make life easier. Beyond simply disabling >>> audit at compile time a number of Linux distributions effectively >>> shortcut audit at runtime by adding a "never" rule to the audit >>> filter, for example: >>> >>> % auditctl -a task,never >> >> As has been brought up, the issue we're facing is that distros have >> CONFIG_AUDIT=y and hence the above is the best real world case outside >> of people doing custom kernels. My question would then be how much >> overhead the above will add, considering it's an entry/exit call per op. >> If auditctl is turned off, what is the expectation in turns of overhead? > > I commented on that case in my last email to Pavel, but I'll try to go > over it again in a little more detail. > > As we discussed earlier in this thread, we can skip the req->opcode > check before both the _entry and _exit calls, so we are left with just > the bare audit calls in the io_uring code. As the _entry and _exit > functions are small, I've copied them and their supporting functions > below and I'll try to explain what would happen in CONFIG_AUDIT=y, > "task,never" case. > > + static inline struct audit_context *audit_context(void) > + { > + return current->audit_context; > + } > > + static inline bool audit_dummy_context(void) > + { > + void *p = audit_context(); > + return !p || *(int *)p; > + } > > + static inline void audit_uring_entry(u8 op) > + { > + if (unlikely(audit_enabled && audit_context())) > + __audit_uring_entry(op); > + } I'd rather agree that it's my cycle-picking. The case I care about is CONFIG_AUDIT=y (because everybody enable it), and io_uring tracing _not_ enabled at runtime. If enabled let them suffer the overhead, it will probably dip down the performance So, for the case I care about it's two of if (unlikely(audit_enabled && current->audit_context)) in the hot path. load-test-jump + current, so it will be around 7x2 instructions. We can throw away audit_enabled as you say systemd already enables it, that will give 4x2 instructions including 2 conditional jumps. That's not great at all. And that's why I brought up the question about need of pre and post hooks and whether can be combined. Would be just 4 instructions and that is ok (ish). > We would need to check with the current security requirements (there > are distro people on the linux-audit list that keep track of that > stuff), but looking at the opcodes right now my gut feeling is that > most of the opcodes would be considered "security relevant" so > selective auditing might not be that useful in practice. It would > definitely clutter the code and increase the chances that new opcodes > would not be properly audited when they are merged. I'm curious, why it's enabled by many distros by default? Are there use cases they use? Tempting to add AUDIT_IOURING=default N, but won't work I guess
On Wed, May 26, 2021 at 6:19 AM Pavel Begunkov <asml.silence@gmail.com> wrote: > On 5/26/21 3:04 AM, Paul Moore wrote: > > On Tue, May 25, 2021 at 9:11 PM Jens Axboe <axboe@kernel.dk> wrote: > >> On 5/24/21 1:59 PM, Paul Moore wrote: > >>> That said, audit is not for everyone, and we have build time and > >>> runtime options to help make life easier. Beyond simply disabling > >>> audit at compile time a number of Linux distributions effectively > >>> shortcut audit at runtime by adding a "never" rule to the audit > >>> filter, for example: > >>> > >>> % auditctl -a task,never > >> > >> As has been brought up, the issue we're facing is that distros have > >> CONFIG_AUDIT=y and hence the above is the best real world case outside > >> of people doing custom kernels. My question would then be how much > >> overhead the above will add, considering it's an entry/exit call per op. > >> If auditctl is turned off, what is the expectation in turns of overhead? > > > > I commented on that case in my last email to Pavel, but I'll try to go > > over it again in a little more detail. > > > > As we discussed earlier in this thread, we can skip the req->opcode > > check before both the _entry and _exit calls, so we are left with just > > the bare audit calls in the io_uring code. As the _entry and _exit > > functions are small, I've copied them and their supporting functions > > below and I'll try to explain what would happen in CONFIG_AUDIT=y, > > "task,never" case. > > > > + static inline struct audit_context *audit_context(void) > > + { > > + return current->audit_context; > > + } > > > > + static inline bool audit_dummy_context(void) > > + { > > + void *p = audit_context(); > > + return !p || *(int *)p; > > + } > > > > + static inline void audit_uring_entry(u8 op) > > + { > > + if (unlikely(audit_enabled && audit_context())) > > + __audit_uring_entry(op); > > + } > > I'd rather agree that it's my cycle-picking. The case I care about > is CONFIG_AUDIT=y (because everybody enable it), and io_uring > tracing _not_ enabled at runtime. If enabled let them suffer > the overhead, it will probably dip down the performance > > So, for the case I care about it's two of > > if (unlikely(audit_enabled && current->audit_context)) > > in the hot path. load-test-jump + current, so it will > be around 7x2 instructions. We can throw away audit_enabled > as you say systemd already enables it, that will give > 4x2 instructions including 2 conditional jumps. We've basically got it down to the equivalent of two "current->audit_context != NULL" checks in the case where audit is built into the kernel but disabled at runtime, e.g. CONFIG_AUDIT=y and "task,never". I'm at a loss for how we can lower the overhead any further, but I'm open to suggestions. > That's not great at all. And that's why I brought up > the question about need of pre and post hooks and whether > can be combined. Would be just 4 instructions and that is > ok (ish). As discussed previously in this thread that isn't really an option from an audit perspective. > > We would need to check with the current security requirements (there > > are distro people on the linux-audit list that keep track of that > > stuff), but looking at the opcodes right now my gut feeling is that > > most of the opcodes would be considered "security relevant" so > > selective auditing might not be that useful in practice. It would > > definitely clutter the code and increase the chances that new opcodes > > would not be properly audited when they are merged. > > I'm curious, why it's enabled by many distros by default? Are there > use cases they use? We've already talked about certain users and environments where audit is an important requirement, e.g. public sector, health care, financial institutions, etc.; without audit Linux wouldn't be an option for these users, at least not without heavy modification, out-of-tree/ISV patches, etc. I currently don't have any direct ties to any distros, "Enterprise" or otherwise, but in the past it has been my experience that distros much prefer to have a single kernel build to address the needs of all their users. In the few cases I have seen where a second kernel build is supported it is usually for hardware enablement. I'm sure there are other cases too, I just haven't seen them personally; the big distros definitely seem to have a strong desire to limit the number of supported kernel configs/builds. > Tempting to add AUDIT_IOURING=default N, but won't work I guess One of the nice things about audit is that it can give you a history of what a user did on a system, which is very important for a number of use cases. If we selectively disable audit for certain subsystems we create a blind spot in the audit log, and in the case of io_uring this can be a very serious blind spot. I fear that if we can't come to some agreement here we will need to make io_uring and audit mutually exclusive at build time which would be awful; forcing many distros to either make a hard choice or carry out-of-tree patches.
On Wednesday, May 26, 2021 10:38:38 AM EDT Paul Moore wrote: > > > We would need to check with the current security requirements (there > > > are distro people on the linux-audit list that keep track of that > > > stuff), The requirements generally care about resource access. File open, connect, accept, etc. We don't care about read/write itself as that would flood the analysis. > > > but looking at the opcodes right now my gut feeling is that > > > most of the opcodes would be considered "security relevant" so > > > selective auditing might not be that useful in practice. I'd say maybe a quarter to a third look interesting. > > > It would > > > definitely clutter the code and increase the chances that new opcodes > > > would not be properly audited when they are merged. There is that... > > I'm curious, why it's enabled by many distros by default? Are there > > use cases they use? > > We've already talked about certain users and environments where audit > is an important requirement, e.g. public sector, health care, > financial institutions, etc.; without audit Linux wouldn't be an > option for these users, People that care about auditing are under regulatory mandates. They care more about the audit event than the performance. Imagine you have a system with some brand new medical discovery. You want to know anyone who accesses the information in case it gets leaked out. You don't care how slow the system gets - you simply *have* to know everyone who's looked at the documents. -Steve
Am 26.05.21 um 16:38 schrieb Paul Moore: > On Wed, May 26, 2021 at 6:19 AM Pavel Begunkov <asml.silence@gmail.com> wrote: >> On 5/26/21 3:04 AM, Paul Moore wrote: >>> On Tue, May 25, 2021 at 9:11 PM Jens Axboe <axboe@kernel.dk> wrote: >>>> On 5/24/21 1:59 PM, Paul Moore wrote: >>>>> That said, audit is not for everyone, and we have build time and >>>>> runtime options to help make life easier. Beyond simply disabling >>>>> audit at compile time a number of Linux distributions effectively >>>>> shortcut audit at runtime by adding a "never" rule to the audit >>>>> filter, for example: >>>>> >>>>> % auditctl -a task,never >>>> >>>> As has been brought up, the issue we're facing is that distros have >>>> CONFIG_AUDIT=y and hence the above is the best real world case outside >>>> of people doing custom kernels. My question would then be how much >>>> overhead the above will add, considering it's an entry/exit call per op. >>>> If auditctl is turned off, what is the expectation in turns of overhead? >>> >>> I commented on that case in my last email to Pavel, but I'll try to go >>> over it again in a little more detail. >>> >>> As we discussed earlier in this thread, we can skip the req->opcode >>> check before both the _entry and _exit calls, so we are left with just >>> the bare audit calls in the io_uring code. As the _entry and _exit >>> functions are small, I've copied them and their supporting functions >>> below and I'll try to explain what would happen in CONFIG_AUDIT=y, >>> "task,never" case. >>> >>> + static inline struct audit_context *audit_context(void) >>> + { >>> + return current->audit_context; >>> + } >>> >>> + static inline bool audit_dummy_context(void) >>> + { >>> + void *p = audit_context(); >>> + return !p || *(int *)p; >>> + } >>> >>> + static inline void audit_uring_entry(u8 op) >>> + { >>> + if (unlikely(audit_enabled && audit_context())) >>> + __audit_uring_entry(op); >>> + } >> >> I'd rather agree that it's my cycle-picking. The case I care about >> is CONFIG_AUDIT=y (because everybody enable it), and io_uring >> tracing _not_ enabled at runtime. If enabled let them suffer >> the overhead, it will probably dip down the performance >> >> So, for the case I care about it's two of >> >> if (unlikely(audit_enabled && current->audit_context)) >> >> in the hot path. load-test-jump + current, so it will >> be around 7x2 instructions. We can throw away audit_enabled >> as you say systemd already enables it, that will give >> 4x2 instructions including 2 conditional jumps. > > We've basically got it down to the equivalent of two > "current->audit_context != NULL" checks in the case where audit is > built into the kernel but disabled at runtime, e.g. CONFIG_AUDIT=y and > "task,never". I'm at a loss for how we can lower the overhead any > further, but I'm open to suggestions. > >> That's not great at all. And that's why I brought up >> the question about need of pre and post hooks and whether >> can be combined. Would be just 4 instructions and that is >> ok (ish). > > As discussed previously in this thread that isn't really an option > from an audit perspective. > >>> We would need to check with the current security requirements (there >>> are distro people on the linux-audit list that keep track of that >>> stuff), but looking at the opcodes right now my gut feeling is that >>> most of the opcodes would be considered "security relevant" so >>> selective auditing might not be that useful in practice. It would >>> definitely clutter the code and increase the chances that new opcodes >>> would not be properly audited when they are merged. >> >> I'm curious, why it's enabled by many distros by default? Are there >> use cases they use? > > We've already talked about certain users and environments where audit > is an important requirement, e.g. public sector, health care, > financial institutions, etc.; without audit Linux wouldn't be an > option for these users, at least not without heavy modification, > out-of-tree/ISV patches, etc. I currently don't have any direct ties > to any distros, "Enterprise" or otherwise, but in the past it has been > my experience that distros much prefer to have a single kernel build > to address the needs of all their users. In the few cases I have seen > where a second kernel build is supported it is usually for hardware > enablement. I'm sure there are other cases too, I just haven't seen > them personally; the big distros definitely seem to have a strong > desire to limit the number of supported kernel configs/builds. > >> Tempting to add AUDIT_IOURING=default N, but won't work I guess > > One of the nice things about audit is that it can give you a history > of what a user did on a system, which is very important for a number > of use cases. If we selectively disable audit for certain subsystems > we create a blind spot in the audit log, and in the case of io_uring > this can be a very serious blind spot. I fear that if we can't come > to some agreement here we will need to make io_uring and audit > mutually exclusive at build time which would be awful; forcing many > distros to either make a hard choice or carry out-of-tree patches. I'm wondering why it's not enough to have the native auditing just to happen. E.g. all (I have checked RECVMSG,SENDMSG,SEND and CONNECT) socket related io_uring opcodes already go via security_socket_{recvmsg,sendmsg,connect}() IORING_OP_OPENAT* goes via do_filp_open() which is in common with the open[at[2]]() syscalls and should also trigger audit_inode() and security_file_open(). So why is there anything special needed for io_uring (now that the native worker threads are used)? Is there really any io_uring opcode that bypasses the security checks the corresponding native syscall would do? If so, I think that should just be fixed... Additional LSM based restrictions could be hooked into the io_check_restriction() path and setup at io_uring_setup() or early io_uring_register() time. What do you think? metze
On 2021-05-26 17:17, Stefan Metzmacher wrote: > > Am 26.05.21 um 16:38 schrieb Paul Moore: > > On Wed, May 26, 2021 at 6:19 AM Pavel Begunkov <asml.silence@gmail.com> wrote: > >> On 5/26/21 3:04 AM, Paul Moore wrote: > >>> On Tue, May 25, 2021 at 9:11 PM Jens Axboe <axboe@kernel.dk> wrote: > >>>> On 5/24/21 1:59 PM, Paul Moore wrote: > >>>>> That said, audit is not for everyone, and we have build time and > >>>>> runtime options to help make life easier. Beyond simply disabling > >>>>> audit at compile time a number of Linux distributions effectively > >>>>> shortcut audit at runtime by adding a "never" rule to the audit > >>>>> filter, for example: > >>>>> > >>>>> % auditctl -a task,never > >>>> > >>>> As has been brought up, the issue we're facing is that distros have > >>>> CONFIG_AUDIT=y and hence the above is the best real world case outside > >>>> of people doing custom kernels. My question would then be how much > >>>> overhead the above will add, considering it's an entry/exit call per op. > >>>> If auditctl is turned off, what is the expectation in turns of overhead? > >>> > >>> I commented on that case in my last email to Pavel, but I'll try to go > >>> over it again in a little more detail. > >>> > >>> As we discussed earlier in this thread, we can skip the req->opcode > >>> check before both the _entry and _exit calls, so we are left with just > >>> the bare audit calls in the io_uring code. As the _entry and _exit > >>> functions are small, I've copied them and their supporting functions > >>> below and I'll try to explain what would happen in CONFIG_AUDIT=y, > >>> "task,never" case. > >>> > >>> + static inline struct audit_context *audit_context(void) > >>> + { > >>> + return current->audit_context; > >>> + } > >>> > >>> + static inline bool audit_dummy_context(void) > >>> + { > >>> + void *p = audit_context(); > >>> + return !p || *(int *)p; > >>> + } > >>> > >>> + static inline void audit_uring_entry(u8 op) > >>> + { > >>> + if (unlikely(audit_enabled && audit_context())) > >>> + __audit_uring_entry(op); > >>> + } > >> > >> I'd rather agree that it's my cycle-picking. The case I care about > >> is CONFIG_AUDIT=y (because everybody enable it), and io_uring > >> tracing _not_ enabled at runtime. If enabled let them suffer > >> the overhead, it will probably dip down the performance > >> > >> So, for the case I care about it's two of > >> > >> if (unlikely(audit_enabled && current->audit_context)) > >> > >> in the hot path. load-test-jump + current, so it will > >> be around 7x2 instructions. We can throw away audit_enabled > >> as you say systemd already enables it, that will give > >> 4x2 instructions including 2 conditional jumps. > > > > We've basically got it down to the equivalent of two > > "current->audit_context != NULL" checks in the case where audit is > > built into the kernel but disabled at runtime, e.g. CONFIG_AUDIT=y and > > "task,never". I'm at a loss for how we can lower the overhead any > > further, but I'm open to suggestions. > > > >> That's not great at all. And that's why I brought up > >> the question about need of pre and post hooks and whether > >> can be combined. Would be just 4 instructions and that is > >> ok (ish). > > > > As discussed previously in this thread that isn't really an option > > from an audit perspective. > > > >>> We would need to check with the current security requirements (there > >>> are distro people on the linux-audit list that keep track of that > >>> stuff), but looking at the opcodes right now my gut feeling is that > >>> most of the opcodes would be considered "security relevant" so > >>> selective auditing might not be that useful in practice. It would > >>> definitely clutter the code and increase the chances that new opcodes > >>> would not be properly audited when they are merged. > >> > >> I'm curious, why it's enabled by many distros by default? Are there > >> use cases they use? > > > > We've already talked about certain users and environments where audit > > is an important requirement, e.g. public sector, health care, > > financial institutions, etc.; without audit Linux wouldn't be an > > option for these users, at least not without heavy modification, > > out-of-tree/ISV patches, etc. I currently don't have any direct ties > > to any distros, "Enterprise" or otherwise, but in the past it has been > > my experience that distros much prefer to have a single kernel build > > to address the needs of all their users. In the few cases I have seen > > where a second kernel build is supported it is usually for hardware > > enablement. I'm sure there are other cases too, I just haven't seen > > them personally; the big distros definitely seem to have a strong > > desire to limit the number of supported kernel configs/builds. > > > >> Tempting to add AUDIT_IOURING=default N, but won't work I guess > > > > One of the nice things about audit is that it can give you a history > > of what a user did on a system, which is very important for a number > > of use cases. If we selectively disable audit for certain subsystems > > we create a blind spot in the audit log, and in the case of io_uring > > this can be a very serious blind spot. I fear that if we can't come > > to some agreement here we will need to make io_uring and audit > > mutually exclusive at build time which would be awful; forcing many > > distros to either make a hard choice or carry out-of-tree patches. > > I'm wondering why it's not enough to have the native auditing just to happen. The audit context needs to be set up for each event. This happens in audit_syslog_entry and audit_syslog_exit. > E.g. all (I have checked RECVMSG,SENDMSG,SEND and CONNECT) socket related io_uring opcodes > already go via security_socket_{recvmsg,sendmsg,connect}() > > IORING_OP_OPENAT* goes via do_filp_open() which is in common with the open[at[2]]() syscalls > and should also trigger audit_inode() and security_file_open(). These are extra hooks to grab operation-specific (syscall) parameters. > So why is there anything special needed for io_uring (now that the native worker threads are used)? Because syscall has been bypassed by a memory-mapped work queue. > Is there really any io_uring opcode that bypasses the security checks the corresponding native syscall > would do? If so, I think that should just be fixed... This is by design to speed it up. This is what Paul's iouring entry and exit hooks do. > Additional LSM based restrictions could be hooked into the io_check_restriction() path > and setup at io_uring_setup() or early io_uring_register() time. > > What do you think? > > metze - RGB -- Richard Guy Briggs <rgb@redhat.com> Sr. S/W Engineer, Kernel Security, Base Operating Systems Remote, Ottawa, Red Hat Canada IRC: rgb, SunRaycer Voice: +1.647.777.2635, Internal: (81) 32635
> I'm wondering why it's not enough to have the native auditing just to happen. > > E.g. all (I have checked RECVMSG,SENDMSG,SEND and CONNECT) socket related io_uring opcodes > already go via security_socket_{recvmsg,sendmsg,connect}() > > IORING_OP_OPENAT* goes via do_filp_open() which is in common with the open[at[2]]() syscalls > and should also trigger audit_inode() and security_file_open(). > > So why is there anything special needed for io_uring (now that the native worker threads are used)? > > Is there really any io_uring opcode that bypasses the security checks the corresponding native syscall > would do? If so, I think that should just be fixed... stefan's points crossed my mind as well. but assuming iouring buy-in is required, from a design perspective, rather than inserting these audit conditionals in the hotpath, wouldn't a layering model work better? aka enabling auditing changes the function entry point into io_uring and passes operations through an auditing layer, then back to the main entry point. then there is no cost to audit disabled code, and you just force audit to pay whatever double processing cost that entails. V
On 5/26/2021 8:49 AM, Victor Stewart wrote: >> I'm wondering why it's not enough to have the native auditing just to happen. >> >> E.g. all (I have checked RECVMSG,SENDMSG,SEND and CONNECT) socket related io_uring opcodes >> already go via security_socket_{recvmsg,sendmsg,connect}() >> >> IORING_OP_OPENAT* goes via do_filp_open() which is in common with the open[at[2]]() syscalls >> and should also trigger audit_inode() and security_file_open(). >> >> So why is there anything special needed for io_uring (now that the native worker threads are used)? >> >> Is there really any io_uring opcode that bypasses the security checks the corresponding native syscall >> would do? If so, I think that should just be fixed... > stefan's points crossed my mind as well. > > but assuming iouring buy-in is required, from a design perspective, > rather than inserting these audit conditionals in the hotpath, > wouldn't a layering model work better? > aka enabling auditing changes the function entry point into io_uring > and passes operations through an auditing layer, then back to the main > entry point. then there is no > cost to audit disabled code, and you just force audit to pay whatever > double processing cost that entails. There seems to be an assumption that the audit code isn't concerned about processing cost. This is complete nonsense. While an audit system has to be accurate and complete (hence encompassing io_uring) it also has to be sufficiently performant for the system to be useful when it is enabled. It would have been real handy had the audit and LSM requirements been designed into io_uring. The performance implications could have been addressed up front. Instead it all has to be retrofit. io_uring, Audit and LSM need to perform well and the best way to ensure that the combination works is cooperation between the developers. Any response that includes the word "just" is unlikely to be helpful. > > V > > -- > Linux-audit mailing list > Linux-audit@redhat.com > https://listman.redhat.com/mailman/listinfo/linux-audit >
On 5/25/21 8:04 PM, Paul Moore wrote: > On Tue, May 25, 2021 at 9:11 PM Jens Axboe <axboe@kernel.dk> wrote: >> On 5/24/21 1:59 PM, Paul Moore wrote: >>> That said, audit is not for everyone, and we have build time and >>> runtime options to help make life easier. Beyond simply disabling >>> audit at compile time a number of Linux distributions effectively >>> shortcut audit at runtime by adding a "never" rule to the audit >>> filter, for example: >>> >>> % auditctl -a task,never >> >> As has been brought up, the issue we're facing is that distros have >> CONFIG_AUDIT=y and hence the above is the best real world case outside >> of people doing custom kernels. My question would then be how much >> overhead the above will add, considering it's an entry/exit call per op. >> If auditctl is turned off, what is the expectation in turns of overhead? > > I commented on that case in my last email to Pavel, but I'll try to go > over it again in a little more detail. > > As we discussed earlier in this thread, we can skip the req->opcode > check before both the _entry and _exit calls, so we are left with just > the bare audit calls in the io_uring code. As the _entry and _exit > functions are small, I've copied them and their supporting functions > below and I'll try to explain what would happen in CONFIG_AUDIT=y, > "task,never" case. > > + static inline struct audit_context *audit_context(void) > + { > + return current->audit_context; > + } > > + static inline bool audit_dummy_context(void) > + { > + void *p = audit_context(); > + return !p || *(int *)p; > + } > > + static inline void audit_uring_entry(u8 op) > + { > + if (unlikely(audit_enabled && audit_context())) > + __audit_uring_entry(op); > + } > > We have one if statement where the conditional checks on two > individual conditions. The first (audit_enabled) is simply a check to > see if anyone has "turned on" auditing at runtime; historically this > worked rather well, and still does in a number of places, but ever > since systemd has taken to forcing audit on regardless of the admin's > audit configuration it is less useful. The second (audit_context()) > is a check to see if an audit_context has been allocated for the > current task. In the case of "task,never" current->audit_context will > be NULL (see audit_alloc()) and the __audit_uring_entry() slowpath > will never be called. > > Worst case here is checking the value of audit_enabled and > current->audit_context. Depending on which you think is more likely > we can change the order of the check so that the > current->audit_context check is first if you feel that is more likely > to be NULL than audit_enabled is to be false (it may be that way now). > > + static inline void audit_uring_exit(int success, long code) > + { > + if (unlikely(!audit_dummy_context())) > + __audit_uring_exit(success, code); > + } > > The exit call is very similar to the entry call, but in the > "task,never" case it is very simple as the first check to be performed > is the current->audit_context check which we know to be NULL. The > __audit_uring_exit() slowpath will never be called. I actually ran some numbers this morning. The test base is 5.13+, and CONFIG_AUDIT=y and CONFIG_AUDITSYSCALL=y is set for both the baseline test and the test with this series applied. I used your git branch as of this morning. The test case is my usual peak perf test, which is random reads at QD=128 and using polled IO. It's a single core test, not threaded. I ran two different tests - one was having a thread just do the IO, the other is using SQPOLL to do the IO for us. The device is capable than more IOPS than a single core can deliver, so we're CPU limited in this test. Hence it's a good test case as it does actual work, and shows software overhead quite nicely. Runs are very stable (less than 0.5% difference between runs on the same base), yet I did average 4 runs. Kernel SQPOLL IOPS Perf diff --------------------------------------------------------- 5.13 0 3029872 0.0% 5.13 1 3031056 0.0% 5.13 + audit 0 2894160 -4.5% 5.13 + audit 1 2886168 -4.8% That's an immediate drop in perf of almost 5%. Looking at a quick profile of it (nothing fancy, just checking for 'audit' in the profile) shows this: + 2.17% io_uring [kernel.vmlinux] [k] __audit_uring_entry + 0.71% io_uring [kernel.vmlinux] [k] __audit_uring_exit 0.07% io_uring [kernel.vmlinux] [k] __audit_syscall_entry 0.02% io_uring [kernel.vmlinux] [k] __audit_syscall_exit Note that this is with _no_ rules! >> aio never had any audit logging as far as I can tell. I think it'd make >> a lot more sense to selectively enable audit logging only for opcodes >> that we care about. File open/create/unlink/mkdir etc, that kind of >> thing. File level operations that people would care about logging. Would >> they care about logging a buffer registration or a polled read from a >> device/file? I highly doubt it, and we don't do that for alternative >> methods either. Doesn't really make sense for a lot of the other >> operations, imho. > > We would need to check with the current security requirements (there > are distro people on the linux-audit list that keep track of that > stuff), but looking at the opcodes right now my gut feeling is that > most of the opcodes would be considered "security relevant" so > selective auditing might not be that useful in practice. It would > definitely clutter the code and increase the chances that new opcodes > would not be properly audited when they are merged. We don't audit read/write from aio, as mentioned. In the past two decades, I take it that hasn't been a concern? I agree that some opcodes should _definitely_ be audited. Things like opening a file, closing a file, removing/creating a file, mount, etc. But normal read/write, I think that's just utter noise and not useful at all. Auditing on a per-opcode basis is trivial, io_uring already has provisions for flagging opcode requirements and this would just be another one.
On 5/26/21 9:49 AM, Richard Guy Briggs wrote: >> So why is there anything special needed for io_uring (now that the >> native worker threads are used)? > > Because syscall has been bypassed by a memory-mapped work queue. I don't follow this one at all, that's just the delivery mechanism if you choose to use SQPOLL. If you do, then a thread sibling of the original task does the actual system call. There's no magic involved there, and the tasks are related. So care to expand on that? >> Is there really any io_uring opcode that bypasses the security checks the corresponding native syscall >> would do? If so, I think that should just be fixed... > > This is by design to speed it up. This is what Paul's iouring entry and > exit hooks do. As far as I can tell, we're doing double logging at that point, if the syscall helper does the audit as well. We'll get something logging the io_uring opcode (eg IORING_OP_OPENAT2), then log again when we call the fs helper. That's _assuming_ that we always hit the logging part when we call into the vfs, but that's something that can be updated to be true and kept an eye on for future additions. Why is the double logging useful? It only tells you that the invocation was via io_uring as the delivery mechanism rather than the usual system call, but the effect is the same - the file is opened, for example. I feel like I'm missing something here, or the other side is. Or both!
On 5/26/21 11:15 AM, Jens Axboe wrote: > On 5/25/21 8:04 PM, Paul Moore wrote: >> On Tue, May 25, 2021 at 9:11 PM Jens Axboe <axboe@kernel.dk> wrote: >>> On 5/24/21 1:59 PM, Paul Moore wrote: >>>> That said, audit is not for everyone, and we have build time and >>>> runtime options to help make life easier. Beyond simply disabling >>>> audit at compile time a number of Linux distributions effectively >>>> shortcut audit at runtime by adding a "never" rule to the audit >>>> filter, for example: >>>> >>>> % auditctl -a task,never >>> >>> As has been brought up, the issue we're facing is that distros have >>> CONFIG_AUDIT=y and hence the above is the best real world case outside >>> of people doing custom kernels. My question would then be how much >>> overhead the above will add, considering it's an entry/exit call per op. >>> If auditctl is turned off, what is the expectation in turns of overhead? >> >> I commented on that case in my last email to Pavel, but I'll try to go >> over it again in a little more detail. >> >> As we discussed earlier in this thread, we can skip the req->opcode >> check before both the _entry and _exit calls, so we are left with just >> the bare audit calls in the io_uring code. As the _entry and _exit >> functions are small, I've copied them and their supporting functions >> below and I'll try to explain what would happen in CONFIG_AUDIT=y, >> "task,never" case. >> >> + static inline struct audit_context *audit_context(void) >> + { >> + return current->audit_context; >> + } >> >> + static inline bool audit_dummy_context(void) >> + { >> + void *p = audit_context(); >> + return !p || *(int *)p; >> + } >> >> + static inline void audit_uring_entry(u8 op) >> + { >> + if (unlikely(audit_enabled && audit_context())) >> + __audit_uring_entry(op); >> + } >> >> We have one if statement where the conditional checks on two >> individual conditions. The first (audit_enabled) is simply a check to >> see if anyone has "turned on" auditing at runtime; historically this >> worked rather well, and still does in a number of places, but ever >> since systemd has taken to forcing audit on regardless of the admin's >> audit configuration it is less useful. The second (audit_context()) >> is a check to see if an audit_context has been allocated for the >> current task. In the case of "task,never" current->audit_context will >> be NULL (see audit_alloc()) and the __audit_uring_entry() slowpath >> will never be called. >> >> Worst case here is checking the value of audit_enabled and >> current->audit_context. Depending on which you think is more likely >> we can change the order of the check so that the >> current->audit_context check is first if you feel that is more likely >> to be NULL than audit_enabled is to be false (it may be that way now). >> >> + static inline void audit_uring_exit(int success, long code) >> + { >> + if (unlikely(!audit_dummy_context())) >> + __audit_uring_exit(success, code); >> + } >> >> The exit call is very similar to the entry call, but in the >> "task,never" case it is very simple as the first check to be performed >> is the current->audit_context check which we know to be NULL. The >> __audit_uring_exit() slowpath will never be called. > > I actually ran some numbers this morning. The test base is 5.13+, and > CONFIG_AUDIT=y and CONFIG_AUDITSYSCALL=y is set for both the baseline > test and the test with this series applied. I used your git branch as of > this morning. > > The test case is my usual peak perf test, which is random reads at > QD=128 and using polled IO. It's a single core test, not threaded. I ran > two different tests - one was having a thread just do the IO, the other > is using SQPOLL to do the IO for us. The device is capable than more > IOPS than a single core can deliver, so we're CPU limited in this test. > Hence it's a good test case as it does actual work, and shows software > overhead quite nicely. Runs are very stable (less than 0.5% difference > between runs on the same base), yet I did average 4 runs. > > Kernel SQPOLL IOPS Perf diff > --------------------------------------------------------- > 5.13 0 3029872 0.0% > 5.13 1 3031056 0.0% > 5.13 + audit 0 2894160 -4.5% > 5.13 + audit 1 2886168 -4.8% > > That's an immediate drop in perf of almost 5%. Looking at a quick > profile of it (nothing fancy, just checking for 'audit' in the profile) > shows this: > > + 2.17% io_uring [kernel.vmlinux] [k] __audit_uring_entry > + 0.71% io_uring [kernel.vmlinux] [k] __audit_uring_exit > 0.07% io_uring [kernel.vmlinux] [k] __audit_syscall_entry > 0.02% io_uring [kernel.vmlinux] [k] __audit_syscall_exit > > Note that this is with _no_ rules! io_uring also supports a NOP command, which basically just measures reqs/sec through the interface. Ran that as well: Kernel SQPOLL IOPS Perf diff --------------------------------------------------------- 5.13 0 31.05M 0.0% 5.13 + audit 0 25.31M -18.5% and profile for the latter includes: + 5.19% io_uring [kernel.vmlinux] [k] __audit_uring_entry + 4.31% io_uring [kernel.vmlinux] [k] __audit_uring_exit 0.26% io_uring [kernel.vmlinux] [k] __audit_syscall_entry 0.08% io_uring [kernel.vmlinux] [k] __audit_syscall_exit
On 5/26/21 11:31 AM, Jens Axboe wrote: > On 5/26/21 11:15 AM, Jens Axboe wrote: >> On 5/25/21 8:04 PM, Paul Moore wrote: >>> On Tue, May 25, 2021 at 9:11 PM Jens Axboe <axboe@kernel.dk> wrote: >>>> On 5/24/21 1:59 PM, Paul Moore wrote: >>>>> That said, audit is not for everyone, and we have build time and >>>>> runtime options to help make life easier. Beyond simply disabling >>>>> audit at compile time a number of Linux distributions effectively >>>>> shortcut audit at runtime by adding a "never" rule to the audit >>>>> filter, for example: >>>>> >>>>> % auditctl -a task,never >>>> >>>> As has been brought up, the issue we're facing is that distros have >>>> CONFIG_AUDIT=y and hence the above is the best real world case outside >>>> of people doing custom kernels. My question would then be how much >>>> overhead the above will add, considering it's an entry/exit call per op. >>>> If auditctl is turned off, what is the expectation in turns of overhead? >>> >>> I commented on that case in my last email to Pavel, but I'll try to go >>> over it again in a little more detail. >>> >>> As we discussed earlier in this thread, we can skip the req->opcode >>> check before both the _entry and _exit calls, so we are left with just >>> the bare audit calls in the io_uring code. As the _entry and _exit >>> functions are small, I've copied them and their supporting functions >>> below and I'll try to explain what would happen in CONFIG_AUDIT=y, >>> "task,never" case. >>> >>> + static inline struct audit_context *audit_context(void) >>> + { >>> + return current->audit_context; >>> + } >>> >>> + static inline bool audit_dummy_context(void) >>> + { >>> + void *p = audit_context(); >>> + return !p || *(int *)p; >>> + } >>> >>> + static inline void audit_uring_entry(u8 op) >>> + { >>> + if (unlikely(audit_enabled && audit_context())) >>> + __audit_uring_entry(op); >>> + } >>> >>> We have one if statement where the conditional checks on two >>> individual conditions. The first (audit_enabled) is simply a check to >>> see if anyone has "turned on" auditing at runtime; historically this >>> worked rather well, and still does in a number of places, but ever >>> since systemd has taken to forcing audit on regardless of the admin's >>> audit configuration it is less useful. The second (audit_context()) >>> is a check to see if an audit_context has been allocated for the >>> current task. In the case of "task,never" current->audit_context will >>> be NULL (see audit_alloc()) and the __audit_uring_entry() slowpath >>> will never be called. >>> >>> Worst case here is checking the value of audit_enabled and >>> current->audit_context. Depending on which you think is more likely >>> we can change the order of the check so that the >>> current->audit_context check is first if you feel that is more likely >>> to be NULL than audit_enabled is to be false (it may be that way now). >>> >>> + static inline void audit_uring_exit(int success, long code) >>> + { >>> + if (unlikely(!audit_dummy_context())) >>> + __audit_uring_exit(success, code); >>> + } >>> >>> The exit call is very similar to the entry call, but in the >>> "task,never" case it is very simple as the first check to be performed >>> is the current->audit_context check which we know to be NULL. The >>> __audit_uring_exit() slowpath will never be called. >> >> I actually ran some numbers this morning. The test base is 5.13+, and >> CONFIG_AUDIT=y and CONFIG_AUDITSYSCALL=y is set for both the baseline >> test and the test with this series applied. I used your git branch as of >> this morning. >> >> The test case is my usual peak perf test, which is random reads at >> QD=128 and using polled IO. It's a single core test, not threaded. I ran >> two different tests - one was having a thread just do the IO, the other >> is using SQPOLL to do the IO for us. The device is capable than more >> IOPS than a single core can deliver, so we're CPU limited in this test. >> Hence it's a good test case as it does actual work, and shows software >> overhead quite nicely. Runs are very stable (less than 0.5% difference >> between runs on the same base), yet I did average 4 runs. >> >> Kernel SQPOLL IOPS Perf diff >> --------------------------------------------------------- >> 5.13 0 3029872 0.0% >> 5.13 1 3031056 0.0% >> 5.13 + audit 0 2894160 -4.5% >> 5.13 + audit 1 2886168 -4.8% >> >> That's an immediate drop in perf of almost 5%. Looking at a quick >> profile of it (nothing fancy, just checking for 'audit' in the profile) >> shows this: >> >> + 2.17% io_uring [kernel.vmlinux] [k] __audit_uring_entry >> + 0.71% io_uring [kernel.vmlinux] [k] __audit_uring_exit >> 0.07% io_uring [kernel.vmlinux] [k] __audit_syscall_entry >> 0.02% io_uring [kernel.vmlinux] [k] __audit_syscall_exit >> >> Note that this is with _no_ rules! > > io_uring also supports a NOP command, which basically just measures > reqs/sec through the interface. Ran that as well: > > Kernel SQPOLL IOPS Perf diff > --------------------------------------------------------- > 5.13 0 31.05M 0.0% > 5.13 + audit 0 25.31M -18.5% > > and profile for the latter includes: > > + 5.19% io_uring [kernel.vmlinux] [k] __audit_uring_entry > + 4.31% io_uring [kernel.vmlinux] [k] __audit_uring_exit > 0.26% io_uring [kernel.vmlinux] [k] __audit_syscall_entry > 0.08% io_uring [kernel.vmlinux] [k] __audit_syscall_exit As Pavel correctly pointed it, looks like auditing is enabled. And indeed it was! Hence the above numbers is without having turned off auditing. Running the NOPs after having turned off audit, we get 30.6M IOPS, which is down about 1.5% from the baseline. The results for the polled random read test above did _not_ change from this, they are still down the same amount. Note, and I should have included this in the first email, this is not any kind of argument for or against audit logging. It's purely meant to be a set of numbers that show how the current series impacts performance.
On 5/26/21 11:54 AM, Jens Axboe wrote: > On 5/26/21 11:31 AM, Jens Axboe wrote: >> On 5/26/21 11:15 AM, Jens Axboe wrote: >>> On 5/25/21 8:04 PM, Paul Moore wrote: >>>> On Tue, May 25, 2021 at 9:11 PM Jens Axboe <axboe@kernel.dk> wrote: >>>>> On 5/24/21 1:59 PM, Paul Moore wrote: >>>>>> That said, audit is not for everyone, and we have build time and >>>>>> runtime options to help make life easier. Beyond simply disabling >>>>>> audit at compile time a number of Linux distributions effectively >>>>>> shortcut audit at runtime by adding a "never" rule to the audit >>>>>> filter, for example: >>>>>> >>>>>> % auditctl -a task,never >>>>> >>>>> As has been brought up, the issue we're facing is that distros have >>>>> CONFIG_AUDIT=y and hence the above is the best real world case outside >>>>> of people doing custom kernels. My question would then be how much >>>>> overhead the above will add, considering it's an entry/exit call per op. >>>>> If auditctl is turned off, what is the expectation in turns of overhead? >>>> >>>> I commented on that case in my last email to Pavel, but I'll try to go >>>> over it again in a little more detail. >>>> >>>> As we discussed earlier in this thread, we can skip the req->opcode >>>> check before both the _entry and _exit calls, so we are left with just >>>> the bare audit calls in the io_uring code. As the _entry and _exit >>>> functions are small, I've copied them and their supporting functions >>>> below and I'll try to explain what would happen in CONFIG_AUDIT=y, >>>> "task,never" case. >>>> >>>> + static inline struct audit_context *audit_context(void) >>>> + { >>>> + return current->audit_context; >>>> + } >>>> >>>> + static inline bool audit_dummy_context(void) >>>> + { >>>> + void *p = audit_context(); >>>> + return !p || *(int *)p; >>>> + } >>>> >>>> + static inline void audit_uring_entry(u8 op) >>>> + { >>>> + if (unlikely(audit_enabled && audit_context())) >>>> + __audit_uring_entry(op); >>>> + } >>>> >>>> We have one if statement where the conditional checks on two >>>> individual conditions. The first (audit_enabled) is simply a check to >>>> see if anyone has "turned on" auditing at runtime; historically this >>>> worked rather well, and still does in a number of places, but ever >>>> since systemd has taken to forcing audit on regardless of the admin's >>>> audit configuration it is less useful. The second (audit_context()) >>>> is a check to see if an audit_context has been allocated for the >>>> current task. In the case of "task,never" current->audit_context will >>>> be NULL (see audit_alloc()) and the __audit_uring_entry() slowpath >>>> will never be called. >>>> >>>> Worst case here is checking the value of audit_enabled and >>>> current->audit_context. Depending on which you think is more likely >>>> we can change the order of the check so that the >>>> current->audit_context check is first if you feel that is more likely >>>> to be NULL than audit_enabled is to be false (it may be that way now). >>>> >>>> + static inline void audit_uring_exit(int success, long code) >>>> + { >>>> + if (unlikely(!audit_dummy_context())) >>>> + __audit_uring_exit(success, code); >>>> + } >>>> >>>> The exit call is very similar to the entry call, but in the >>>> "task,never" case it is very simple as the first check to be performed >>>> is the current->audit_context check which we know to be NULL. The >>>> __audit_uring_exit() slowpath will never be called. >>> >>> I actually ran some numbers this morning. The test base is 5.13+, and >>> CONFIG_AUDIT=y and CONFIG_AUDITSYSCALL=y is set for both the baseline >>> test and the test with this series applied. I used your git branch as of >>> this morning. >>> >>> The test case is my usual peak perf test, which is random reads at >>> QD=128 and using polled IO. It's a single core test, not threaded. I ran >>> two different tests - one was having a thread just do the IO, the other >>> is using SQPOLL to do the IO for us. The device is capable than more >>> IOPS than a single core can deliver, so we're CPU limited in this test. >>> Hence it's a good test case as it does actual work, and shows software >>> overhead quite nicely. Runs are very stable (less than 0.5% difference >>> between runs on the same base), yet I did average 4 runs. >>> >>> Kernel SQPOLL IOPS Perf diff >>> --------------------------------------------------------- >>> 5.13 0 3029872 0.0% >>> 5.13 1 3031056 0.0% >>> 5.13 + audit 0 2894160 -4.5% >>> 5.13 + audit 1 2886168 -4.8% >>> >>> That's an immediate drop in perf of almost 5%. Looking at a quick >>> profile of it (nothing fancy, just checking for 'audit' in the profile) >>> shows this: >>> >>> + 2.17% io_uring [kernel.vmlinux] [k] __audit_uring_entry >>> + 0.71% io_uring [kernel.vmlinux] [k] __audit_uring_exit >>> 0.07% io_uring [kernel.vmlinux] [k] __audit_syscall_entry >>> 0.02% io_uring [kernel.vmlinux] [k] __audit_syscall_exit >>> >>> Note that this is with _no_ rules! >> >> io_uring also supports a NOP command, which basically just measures >> reqs/sec through the interface. Ran that as well: >> >> Kernel SQPOLL IOPS Perf diff >> --------------------------------------------------------- >> 5.13 0 31.05M 0.0% >> 5.13 + audit 0 25.31M -18.5% >> >> and profile for the latter includes: >> >> + 5.19% io_uring [kernel.vmlinux] [k] __audit_uring_entry >> + 4.31% io_uring [kernel.vmlinux] [k] __audit_uring_exit >> 0.26% io_uring [kernel.vmlinux] [k] __audit_syscall_entry >> 0.08% io_uring [kernel.vmlinux] [k] __audit_syscall_exit > > As Pavel correctly pointed it, looks like auditing is enabled. And > indeed it was! Hence the above numbers is without having turned off > auditing. Running the NOPs after having turned off audit, we get 30.6M > IOPS, which is down about 1.5% from the baseline. The results for the > polled random read test above did _not_ change from this, they are still > down the same amount. > > Note, and I should have included this in the first email, this is not > any kind of argument for or against audit logging. It's purely meant to > be a set of numbers that show how the current series impacts > performance. And finally, just checking if we make it optional per opcode if we see any real impact, and the answer is no. Using the below patch which effectively bypasses audit calls unless the opcode has flagged the need to do so, I cannot measure any difference in perf (as expected). To turn this into something useful, my suggestion as a viable path forward would be: 1) Use something like the below patch and flag request types that we want to do audit logging for. 2) As Pavel suggested, eliminate the need for having both and entry/exit hook, turning it into just one. That effectively cuts the number of checks and calls in half. diff --git a/fs/io_uring.c b/fs/io_uring.c index aa065808ddcf..2c7c913b786b 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -885,6 +885,8 @@ struct io_op_def { unsigned needs_async_setup : 1; /* should block plug */ unsigned plug : 1; + /* should audit */ + unsigned audit : 1; /* size of async data needed, if any */ unsigned short async_size; }; @@ -6122,7 +6124,7 @@ static int io_issue_sqe(struct io_kiocb *req, unsigned int issue_flags) if (req->work.creds && req->work.creds != current_cred()) creds = override_creds(req->work.creds); - if (req->opcode < IORING_OP_LAST) + if (io_op_defs[req->opcode].audit) audit_uring_entry(req->opcode); switch (req->opcode) { @@ -6231,7 +6233,7 @@ static int io_issue_sqe(struct io_kiocb *req, unsigned int issue_flags) break; } - if (req->opcode < IORING_OP_LAST) + if (io_op_defs[req->opcode].audit) audit_uring_exit(!ret, ret); if (creds)
On Wed, May 26, 2021 at 1:54 PM Jens Axboe <axboe@kernel.dk> wrote: > On 5/26/21 11:31 AM, Jens Axboe wrote: > > On 5/26/21 11:15 AM, Jens Axboe wrote: > >> On 5/25/21 8:04 PM, Paul Moore wrote: > >>> On Tue, May 25, 2021 at 9:11 PM Jens Axboe <axboe@kernel.dk> wrote: > >>>> On 5/24/21 1:59 PM, Paul Moore wrote: > >>>>> That said, audit is not for everyone, and we have build time and > >>>>> runtime options to help make life easier. Beyond simply disabling > >>>>> audit at compile time a number of Linux distributions effectively > >>>>> shortcut audit at runtime by adding a "never" rule to the audit > >>>>> filter, for example: > >>>>> > >>>>> % auditctl -a task,never > >>>> > >>>> As has been brought up, the issue we're facing is that distros have > >>>> CONFIG_AUDIT=y and hence the above is the best real world case outside > >>>> of people doing custom kernels. My question would then be how much > >>>> overhead the above will add, considering it's an entry/exit call per op. > >>>> If auditctl is turned off, what is the expectation in turns of overhead? > >>> > >>> I commented on that case in my last email to Pavel, but I'll try to go > >>> over it again in a little more detail. > >>> > >>> As we discussed earlier in this thread, we can skip the req->opcode > >>> check before both the _entry and _exit calls, so we are left with just > >>> the bare audit calls in the io_uring code. As the _entry and _exit > >>> functions are small, I've copied them and their supporting functions > >>> below and I'll try to explain what would happen in CONFIG_AUDIT=y, > >>> "task,never" case. > >>> > >>> + static inline struct audit_context *audit_context(void) > >>> + { > >>> + return current->audit_context; > >>> + } > >>> > >>> + static inline bool audit_dummy_context(void) > >>> + { > >>> + void *p = audit_context(); > >>> + return !p || *(int *)p; > >>> + } > >>> > >>> + static inline void audit_uring_entry(u8 op) > >>> + { > >>> + if (unlikely(audit_enabled && audit_context())) > >>> + __audit_uring_entry(op); > >>> + } > >>> > >>> We have one if statement where the conditional checks on two > >>> individual conditions. The first (audit_enabled) is simply a check to > >>> see if anyone has "turned on" auditing at runtime; historically this > >>> worked rather well, and still does in a number of places, but ever > >>> since systemd has taken to forcing audit on regardless of the admin's > >>> audit configuration it is less useful. The second (audit_context()) > >>> is a check to see if an audit_context has been allocated for the > >>> current task. In the case of "task,never" current->audit_context will > >>> be NULL (see audit_alloc()) and the __audit_uring_entry() slowpath > >>> will never be called. > >>> > >>> Worst case here is checking the value of audit_enabled and > >>> current->audit_context. Depending on which you think is more likely > >>> we can change the order of the check so that the > >>> current->audit_context check is first if you feel that is more likely > >>> to be NULL than audit_enabled is to be false (it may be that way now). > >>> > >>> + static inline void audit_uring_exit(int success, long code) > >>> + { > >>> + if (unlikely(!audit_dummy_context())) > >>> + __audit_uring_exit(success, code); > >>> + } > >>> > >>> The exit call is very similar to the entry call, but in the > >>> "task,never" case it is very simple as the first check to be performed > >>> is the current->audit_context check which we know to be NULL. The > >>> __audit_uring_exit() slowpath will never be called. > >> > >> I actually ran some numbers this morning. The test base is 5.13+, and > >> CONFIG_AUDIT=y and CONFIG_AUDITSYSCALL=y is set for both the baseline > >> test and the test with this series applied. I used your git branch as of > >> this morning. > >> > >> The test case is my usual peak perf test, which is random reads at > >> QD=128 and using polled IO. It's a single core test, not threaded. I ran > >> two different tests - one was having a thread just do the IO, the other > >> is using SQPOLL to do the IO for us. The device is capable than more > >> IOPS than a single core can deliver, so we're CPU limited in this test. > >> Hence it's a good test case as it does actual work, and shows software > >> overhead quite nicely. Runs are very stable (less than 0.5% difference > >> between runs on the same base), yet I did average 4 runs. > >> > >> Kernel SQPOLL IOPS Perf diff > >> --------------------------------------------------------- > >> 5.13 0 3029872 0.0% > >> 5.13 1 3031056 0.0% > >> 5.13 + audit 0 2894160 -4.5% > >> 5.13 + audit 1 2886168 -4.8% > >> > >> That's an immediate drop in perf of almost 5%. Looking at a quick > >> profile of it (nothing fancy, just checking for 'audit' in the profile) > >> shows this: > >> > >> + 2.17% io_uring [kernel.vmlinux] [k] __audit_uring_entry > >> + 0.71% io_uring [kernel.vmlinux] [k] __audit_uring_exit > >> 0.07% io_uring [kernel.vmlinux] [k] __audit_syscall_entry > >> 0.02% io_uring [kernel.vmlinux] [k] __audit_syscall_exit > >> > >> Note that this is with _no_ rules! > > > > io_uring also supports a NOP command, which basically just measures > > reqs/sec through the interface. Ran that as well: > > > > Kernel SQPOLL IOPS Perf diff > > --------------------------------------------------------- > > 5.13 0 31.05M 0.0% > > 5.13 + audit 0 25.31M -18.5% > > > > and profile for the latter includes: > > > > + 5.19% io_uring [kernel.vmlinux] [k] __audit_uring_entry > > + 4.31% io_uring [kernel.vmlinux] [k] __audit_uring_exit > > 0.26% io_uring [kernel.vmlinux] [k] __audit_syscall_entry > > 0.08% io_uring [kernel.vmlinux] [k] __audit_syscall_exit > > As Pavel correctly pointed it, looks like auditing is enabled. And > indeed it was! Hence the above numbers is without having turned off > auditing. Running the NOPs after having turned off audit, we get 30.6M > IOPS, which is down about 1.5% from the baseline. The results for the > polled random read test above did _not_ change from this, they are still > down the same amount. > > Note, and I should have included this in the first email, this is not > any kind of argument for or against audit logging. It's purely meant to > be a set of numbers that show how the current series impacts > performance. Thanks for running some tests Jens, unfortunately the git tree hadn't been updated to reflect the improved audit_uring_entry() and audit_uring_exit() functions as we were still discussing things and I thought there still might be some changes. I just now updated the branch with the latest code so if you have the cycles (ha!) to run another perf test I think those numbers would be more interesting. For example, I don't believe you should see __audit_uring_entry() or __audit_uring_exit() at all if you have the audit "task,never" rule loaded; you will see audit_uring_entry() and audit_uring_exit() but as we discussed previously those should just be a single "current->audit_context != NULL" check. Also, I want to pull back a bit as I think there is confusion about how audit works and why these changes are necessary. As everyone likely knows, there are audit calls sprinkled throughout the kernel code, e.g. audit_log_format() and similar. In addition to these calls that most are aware of, there are setup/teardown audit calls that run at task creation and destruction as well as at syscall entry and exit. It is these lesser known calls that are responsible for the filtering, setup, and general management of the audit context state in the system; they also handle generation of some audit records such as SYSCALL, PATH, etc. based on information that is recorded by audit calls inserted into other places in the kernel. The audit_alloc_kernel() and audit_free() calls we are adding to the io_uring/io-wq code are intended to do similar things to the existing audit task creation/destruction hooks, but for the io_uring/io-wq kernel threads. Similarly the audit_uring_entry() and audit_uring_exit() calls are intended to act as replacements for the syscall entry and exit code. If we want the existing audit calls in the kernel to work properly, we need these setup/teardown functions. Hopefully that makes things a bit more clear regarding these calls in the io_uring/io-wq code. Another point I wanted to address is the "double audit" (!!!) that has been mentioned recently in this thread. Don't get too excited, this isn't quite what you think it is, it is a side effect of how io_uring can dispatch certain operations. As I think the io_uring folks already know, io_uring can process I/O ops in several different contexts; one of which is after a syscall completes but before the kernel returns to userspace. In this particular case things can get rather interesting from an audit perspective as we need to handle both the syscall audit records *and* the io_uring operation records. If you look closer at the audit code in this patchset you'll see some of the fun stuff we need to do to make sure this case is handled correctly. If you happen to see a code path that takes you through audit_syscall_entry() + <syscall_stuff> + audit_uring_entry() + <io_uring_stuff> + audit_uring_exit() + audit_syscall_exit() rest assured that is correct :) Of course there is the normal audit_uring_entry() and audit_uring_exit() without the audit syscall hooks in other code paths, e.g. SQPOLL, but those are less dramatic than the "OMG, double audit!!!" ;)
On Wed, May 26, 2021 at 2:01 PM Jens Axboe <axboe@kernel.dk> wrote: > On 5/26/21 11:54 AM, Jens Axboe wrote: > > On 5/26/21 11:31 AM, Jens Axboe wrote: > >> On 5/26/21 11:15 AM, Jens Axboe wrote: > >>> On 5/25/21 8:04 PM, Paul Moore wrote: > >>>> On Tue, May 25, 2021 at 9:11 PM Jens Axboe <axboe@kernel.dk> wrote: > >>>>> On 5/24/21 1:59 PM, Paul Moore wrote: > >>>>>> That said, audit is not for everyone, and we have build time and > >>>>>> runtime options to help make life easier. Beyond simply disabling > >>>>>> audit at compile time a number of Linux distributions effectively > >>>>>> shortcut audit at runtime by adding a "never" rule to the audit > >>>>>> filter, for example: > >>>>>> > >>>>>> % auditctl -a task,never > >>>>> > >>>>> As has been brought up, the issue we're facing is that distros have > >>>>> CONFIG_AUDIT=y and hence the above is the best real world case outside > >>>>> of people doing custom kernels. My question would then be how much > >>>>> overhead the above will add, considering it's an entry/exit call per op. > >>>>> If auditctl is turned off, what is the expectation in turns of overhead? > >>>> > >>>> I commented on that case in my last email to Pavel, but I'll try to go > >>>> over it again in a little more detail. > >>>> > >>>> As we discussed earlier in this thread, we can skip the req->opcode > >>>> check before both the _entry and _exit calls, so we are left with just > >>>> the bare audit calls in the io_uring code. As the _entry and _exit > >>>> functions are small, I've copied them and their supporting functions > >>>> below and I'll try to explain what would happen in CONFIG_AUDIT=y, > >>>> "task,never" case. > >>>> > >>>> + static inline struct audit_context *audit_context(void) > >>>> + { > >>>> + return current->audit_context; > >>>> + } > >>>> > >>>> + static inline bool audit_dummy_context(void) > >>>> + { > >>>> + void *p = audit_context(); > >>>> + return !p || *(int *)p; > >>>> + } > >>>> > >>>> + static inline void audit_uring_entry(u8 op) > >>>> + { > >>>> + if (unlikely(audit_enabled && audit_context())) > >>>> + __audit_uring_entry(op); > >>>> + } > >>>> > >>>> We have one if statement where the conditional checks on two > >>>> individual conditions. The first (audit_enabled) is simply a check to > >>>> see if anyone has "turned on" auditing at runtime; historically this > >>>> worked rather well, and still does in a number of places, but ever > >>>> since systemd has taken to forcing audit on regardless of the admin's > >>>> audit configuration it is less useful. The second (audit_context()) > >>>> is a check to see if an audit_context has been allocated for the > >>>> current task. In the case of "task,never" current->audit_context will > >>>> be NULL (see audit_alloc()) and the __audit_uring_entry() slowpath > >>>> will never be called. > >>>> > >>>> Worst case here is checking the value of audit_enabled and > >>>> current->audit_context. Depending on which you think is more likely > >>>> we can change the order of the check so that the > >>>> current->audit_context check is first if you feel that is more likely > >>>> to be NULL than audit_enabled is to be false (it may be that way now). > >>>> > >>>> + static inline void audit_uring_exit(int success, long code) > >>>> + { > >>>> + if (unlikely(!audit_dummy_context())) > >>>> + __audit_uring_exit(success, code); > >>>> + } > >>>> > >>>> The exit call is very similar to the entry call, but in the > >>>> "task,never" case it is very simple as the first check to be performed > >>>> is the current->audit_context check which we know to be NULL. The > >>>> __audit_uring_exit() slowpath will never be called. > >>> > >>> I actually ran some numbers this morning. The test base is 5.13+, and > >>> CONFIG_AUDIT=y and CONFIG_AUDITSYSCALL=y is set for both the baseline > >>> test and the test with this series applied. I used your git branch as of > >>> this morning. > >>> > >>> The test case is my usual peak perf test, which is random reads at > >>> QD=128 and using polled IO. It's a single core test, not threaded. I ran > >>> two different tests - one was having a thread just do the IO, the other > >>> is using SQPOLL to do the IO for us. The device is capable than more > >>> IOPS than a single core can deliver, so we're CPU limited in this test. > >>> Hence it's a good test case as it does actual work, and shows software > >>> overhead quite nicely. Runs are very stable (less than 0.5% difference > >>> between runs on the same base), yet I did average 4 runs. > >>> > >>> Kernel SQPOLL IOPS Perf diff > >>> --------------------------------------------------------- > >>> 5.13 0 3029872 0.0% > >>> 5.13 1 3031056 0.0% > >>> 5.13 + audit 0 2894160 -4.5% > >>> 5.13 + audit 1 2886168 -4.8% > >>> > >>> That's an immediate drop in perf of almost 5%. Looking at a quick > >>> profile of it (nothing fancy, just checking for 'audit' in the profile) > >>> shows this: > >>> > >>> + 2.17% io_uring [kernel.vmlinux] [k] __audit_uring_entry > >>> + 0.71% io_uring [kernel.vmlinux] [k] __audit_uring_exit > >>> 0.07% io_uring [kernel.vmlinux] [k] __audit_syscall_entry > >>> 0.02% io_uring [kernel.vmlinux] [k] __audit_syscall_exit > >>> > >>> Note that this is with _no_ rules! > >> > >> io_uring also supports a NOP command, which basically just measures > >> reqs/sec through the interface. Ran that as well: > >> > >> Kernel SQPOLL IOPS Perf diff > >> --------------------------------------------------------- > >> 5.13 0 31.05M 0.0% > >> 5.13 + audit 0 25.31M -18.5% > >> > >> and profile for the latter includes: > >> > >> + 5.19% io_uring [kernel.vmlinux] [k] __audit_uring_entry > >> + 4.31% io_uring [kernel.vmlinux] [k] __audit_uring_exit > >> 0.26% io_uring [kernel.vmlinux] [k] __audit_syscall_entry > >> 0.08% io_uring [kernel.vmlinux] [k] __audit_syscall_exit > > > > As Pavel correctly pointed it, looks like auditing is enabled. And > > indeed it was! Hence the above numbers is without having turned off > > auditing. Running the NOPs after having turned off audit, we get 30.6M > > IOPS, which is down about 1.5% from the baseline. The results for the > > polled random read test above did _not_ change from this, they are still > > down the same amount. > > > > Note, and I should have included this in the first email, this is not > > any kind of argument for or against audit logging. It's purely meant to > > be a set of numbers that show how the current series impacts > > performance. > > And finally, just checking if we make it optional per opcode if we see > any real impact, and the answer is no. Using the below patch which > effectively bypasses audit calls unless the opcode has flagged the need > to do so, I cannot measure any difference in perf (as expected). > > To turn this into something useful, my suggestion as a viable path > forward would be: > > 1) Use something like the below patch and flag request types that we > want to do audit logging for. > > 2) As Pavel suggested, eliminate the need for having both and entry/exit > hook, turning it into just one. That effectively cuts the number of > checks and calls in half. I suspect the updated working-io_uring branch with HEAD at 1f25193a3f54 (updated a short time ago, see my last email in this thread) will improve performance. Also, as has been mention several times now, for audit to work we need both the _entry and _exit call.
On 5/26/21 7:44 PM, Paul Moore wrote: > On Wed, May 26, 2021 at 2:01 PM Jens Axboe <axboe@kernel.dk> wrote: >> On 5/26/21 11:54 AM, Jens Axboe wrote: >>> On 5/26/21 11:31 AM, Jens Axboe wrote: >>>> On 5/26/21 11:15 AM, Jens Axboe wrote: >>>>> On 5/25/21 8:04 PM, Paul Moore wrote: >>>>>> On Tue, May 25, 2021 at 9:11 PM Jens Axboe <axboe@kernel.dk> wrote: >>>>>>> On 5/24/21 1:59 PM, Paul Moore wrote: >>>>>>>> That said, audit is not for everyone, and we have build time and >>>>>>>> runtime options to help make life easier. Beyond simply disabling >>>>>>>> audit at compile time a number of Linux distributions effectively >>>>>>>> shortcut audit at runtime by adding a "never" rule to the audit >>>>>>>> filter, for example: >>>>>>>> >>>>>>>> % auditctl -a task,never >>>>>>> >>>>>>> As has been brought up, the issue we're facing is that distros have >>>>>>> CONFIG_AUDIT=y and hence the above is the best real world case outside >>>>>>> of people doing custom kernels. My question would then be how much >>>>>>> overhead the above will add, considering it's an entry/exit call per op. >>>>>>> If auditctl is turned off, what is the expectation in turns of overhead? >>>>>> >>>>>> I commented on that case in my last email to Pavel, but I'll try to go >>>>>> over it again in a little more detail. >>>>>> >>>>>> As we discussed earlier in this thread, we can skip the req->opcode >>>>>> check before both the _entry and _exit calls, so we are left with just >>>>>> the bare audit calls in the io_uring code. As the _entry and _exit >>>>>> functions are small, I've copied them and their supporting functions >>>>>> below and I'll try to explain what would happen in CONFIG_AUDIT=y, >>>>>> "task,never" case. >>>>>> >>>>>> + static inline struct audit_context *audit_context(void) >>>>>> + { >>>>>> + return current->audit_context; >>>>>> + } >>>>>> >>>>>> + static inline bool audit_dummy_context(void) >>>>>> + { >>>>>> + void *p = audit_context(); >>>>>> + return !p || *(int *)p; >>>>>> + } >>>>>> >>>>>> + static inline void audit_uring_entry(u8 op) >>>>>> + { >>>>>> + if (unlikely(audit_enabled && audit_context())) >>>>>> + __audit_uring_entry(op); >>>>>> + } >>>>>> >>>>>> We have one if statement where the conditional checks on two >>>>>> individual conditions. The first (audit_enabled) is simply a check to >>>>>> see if anyone has "turned on" auditing at runtime; historically this >>>>>> worked rather well, and still does in a number of places, but ever >>>>>> since systemd has taken to forcing audit on regardless of the admin's >>>>>> audit configuration it is less useful. The second (audit_context()) >>>>>> is a check to see if an audit_context has been allocated for the >>>>>> current task. In the case of "task,never" current->audit_context will >>>>>> be NULL (see audit_alloc()) and the __audit_uring_entry() slowpath >>>>>> will never be called. >>>>>> >>>>>> Worst case here is checking the value of audit_enabled and >>>>>> current->audit_context. Depending on which you think is more likely >>>>>> we can change the order of the check so that the >>>>>> current->audit_context check is first if you feel that is more likely >>>>>> to be NULL than audit_enabled is to be false (it may be that way now). >>>>>> >>>>>> + static inline void audit_uring_exit(int success, long code) >>>>>> + { >>>>>> + if (unlikely(!audit_dummy_context())) >>>>>> + __audit_uring_exit(success, code); >>>>>> + } >>>>>> >>>>>> The exit call is very similar to the entry call, but in the >>>>>> "task,never" case it is very simple as the first check to be performed >>>>>> is the current->audit_context check which we know to be NULL. The >>>>>> __audit_uring_exit() slowpath will never be called. >>>>> >>>>> I actually ran some numbers this morning. The test base is 5.13+, and >>>>> CONFIG_AUDIT=y and CONFIG_AUDITSYSCALL=y is set for both the baseline >>>>> test and the test with this series applied. I used your git branch as of >>>>> this morning. >>>>> >>>>> The test case is my usual peak perf test, which is random reads at >>>>> QD=128 and using polled IO. It's a single core test, not threaded. I ran >>>>> two different tests - one was having a thread just do the IO, the other >>>>> is using SQPOLL to do the IO for us. The device is capable than more >>>>> IOPS than a single core can deliver, so we're CPU limited in this test. >>>>> Hence it's a good test case as it does actual work, and shows software >>>>> overhead quite nicely. Runs are very stable (less than 0.5% difference >>>>> between runs on the same base), yet I did average 4 runs. >>>>> >>>>> Kernel SQPOLL IOPS Perf diff >>>>> --------------------------------------------------------- >>>>> 5.13 0 3029872 0.0% >>>>> 5.13 1 3031056 0.0% >>>>> 5.13 + audit 0 2894160 -4.5% >>>>> 5.13 + audit 1 2886168 -4.8% >>>>> >>>>> That's an immediate drop in perf of almost 5%. Looking at a quick >>>>> profile of it (nothing fancy, just checking for 'audit' in the profile) >>>>> shows this: >>>>> >>>>> + 2.17% io_uring [kernel.vmlinux] [k] __audit_uring_entry >>>>> + 0.71% io_uring [kernel.vmlinux] [k] __audit_uring_exit >>>>> 0.07% io_uring [kernel.vmlinux] [k] __audit_syscall_entry >>>>> 0.02% io_uring [kernel.vmlinux] [k] __audit_syscall_exit >>>>> >>>>> Note that this is with _no_ rules! >>>> >>>> io_uring also supports a NOP command, which basically just measures >>>> reqs/sec through the interface. Ran that as well: >>>> >>>> Kernel SQPOLL IOPS Perf diff >>>> --------------------------------------------------------- >>>> 5.13 0 31.05M 0.0% >>>> 5.13 + audit 0 25.31M -18.5% >>>> >>>> and profile for the latter includes: >>>> >>>> + 5.19% io_uring [kernel.vmlinux] [k] __audit_uring_entry >>>> + 4.31% io_uring [kernel.vmlinux] [k] __audit_uring_exit >>>> 0.26% io_uring [kernel.vmlinux] [k] __audit_syscall_entry >>>> 0.08% io_uring [kernel.vmlinux] [k] __audit_syscall_exit >>> >>> As Pavel correctly pointed it, looks like auditing is enabled. And >>> indeed it was! Hence the above numbers is without having turned off >>> auditing. Running the NOPs after having turned off audit, we get 30.6M >>> IOPS, which is down about 1.5% from the baseline. The results for the >>> polled random read test above did _not_ change from this, they are still >>> down the same amount. >>> >>> Note, and I should have included this in the first email, this is not >>> any kind of argument for or against audit logging. It's purely meant to >>> be a set of numbers that show how the current series impacts >>> performance. >> >> And finally, just checking if we make it optional per opcode if we see >> any real impact, and the answer is no. Using the below patch which >> effectively bypasses audit calls unless the opcode has flagged the need >> to do so, I cannot measure any difference in perf (as expected). >> >> To turn this into something useful, my suggestion as a viable path >> forward would be: >> >> 1) Use something like the below patch and flag request types that we >> want to do audit logging for. >> >> 2) As Pavel suggested, eliminate the need for having both and entry/exit >> hook, turning it into just one. That effectively cuts the number of >> checks and calls in half. > > I suspect the updated working-io_uring branch with HEAD at > 1f25193a3f54 (updated a short time ago, see my last email in this > thread) will improve performance. Also, as has been mention several See the email you replied to, ~1.5% was basically an overhead of two `if (io_op_defs[req->opcode].audit)` in case of nops, where at least once req->opcode is cached. But to be completely fair, misses unlikely
On Wed, May 26, 2021 at 2:57 PM Pavel Begunkov <asml.silence@gmail.com> wrote: > On 5/26/21 7:44 PM, Paul Moore wrote: > > On Wed, May 26, 2021 at 2:01 PM Jens Axboe <axboe@kernel.dk> wrote: > >> On 5/26/21 11:54 AM, Jens Axboe wrote: > >>> On 5/26/21 11:31 AM, Jens Axboe wrote: > >>>> On 5/26/21 11:15 AM, Jens Axboe wrote: > >>>>> On 5/25/21 8:04 PM, Paul Moore wrote: > >>>>>> On Tue, May 25, 2021 at 9:11 PM Jens Axboe <axboe@kernel.dk> wrote: > >>>>>>> On 5/24/21 1:59 PM, Paul Moore wrote: > >>>>>>>> That said, audit is not for everyone, and we have build time and > >>>>>>>> runtime options to help make life easier. Beyond simply disabling > >>>>>>>> audit at compile time a number of Linux distributions effectively > >>>>>>>> shortcut audit at runtime by adding a "never" rule to the audit > >>>>>>>> filter, for example: > >>>>>>>> > >>>>>>>> % auditctl -a task,never > >>>>>>> > >>>>>>> As has been brought up, the issue we're facing is that distros have > >>>>>>> CONFIG_AUDIT=y and hence the above is the best real world case outside > >>>>>>> of people doing custom kernels. My question would then be how much > >>>>>>> overhead the above will add, considering it's an entry/exit call per op. > >>>>>>> If auditctl is turned off, what is the expectation in turns of overhead? > >>>>>> > >>>>>> I commented on that case in my last email to Pavel, but I'll try to go > >>>>>> over it again in a little more detail. > >>>>>> > >>>>>> As we discussed earlier in this thread, we can skip the req->opcode > >>>>>> check before both the _entry and _exit calls, so we are left with just > >>>>>> the bare audit calls in the io_uring code. As the _entry and _exit > >>>>>> functions are small, I've copied them and their supporting functions > >>>>>> below and I'll try to explain what would happen in CONFIG_AUDIT=y, > >>>>>> "task,never" case. > >>>>>> > >>>>>> + static inline struct audit_context *audit_context(void) > >>>>>> + { > >>>>>> + return current->audit_context; > >>>>>> + } > >>>>>> > >>>>>> + static inline bool audit_dummy_context(void) > >>>>>> + { > >>>>>> + void *p = audit_context(); > >>>>>> + return !p || *(int *)p; > >>>>>> + } > >>>>>> > >>>>>> + static inline void audit_uring_entry(u8 op) > >>>>>> + { > >>>>>> + if (unlikely(audit_enabled && audit_context())) > >>>>>> + __audit_uring_entry(op); > >>>>>> + } > >>>>>> > >>>>>> We have one if statement where the conditional checks on two > >>>>>> individual conditions. The first (audit_enabled) is simply a check to > >>>>>> see if anyone has "turned on" auditing at runtime; historically this > >>>>>> worked rather well, and still does in a number of places, but ever > >>>>>> since systemd has taken to forcing audit on regardless of the admin's > >>>>>> audit configuration it is less useful. The second (audit_context()) > >>>>>> is a check to see if an audit_context has been allocated for the > >>>>>> current task. In the case of "task,never" current->audit_context will > >>>>>> be NULL (see audit_alloc()) and the __audit_uring_entry() slowpath > >>>>>> will never be called. > >>>>>> > >>>>>> Worst case here is checking the value of audit_enabled and > >>>>>> current->audit_context. Depending on which you think is more likely > >>>>>> we can change the order of the check so that the > >>>>>> current->audit_context check is first if you feel that is more likely > >>>>>> to be NULL than audit_enabled is to be false (it may be that way now). > >>>>>> > >>>>>> + static inline void audit_uring_exit(int success, long code) > >>>>>> + { > >>>>>> + if (unlikely(!audit_dummy_context())) > >>>>>> + __audit_uring_exit(success, code); > >>>>>> + } > >>>>>> > >>>>>> The exit call is very similar to the entry call, but in the > >>>>>> "task,never" case it is very simple as the first check to be performed > >>>>>> is the current->audit_context check which we know to be NULL. The > >>>>>> __audit_uring_exit() slowpath will never be called. > >>>>> > >>>>> I actually ran some numbers this morning. The test base is 5.13+, and > >>>>> CONFIG_AUDIT=y and CONFIG_AUDITSYSCALL=y is set for both the baseline > >>>>> test and the test with this series applied. I used your git branch as of > >>>>> this morning. > >>>>> > >>>>> The test case is my usual peak perf test, which is random reads at > >>>>> QD=128 and using polled IO. It's a single core test, not threaded. I ran > >>>>> two different tests - one was having a thread just do the IO, the other > >>>>> is using SQPOLL to do the IO for us. The device is capable than more > >>>>> IOPS than a single core can deliver, so we're CPU limited in this test. > >>>>> Hence it's a good test case as it does actual work, and shows software > >>>>> overhead quite nicely. Runs are very stable (less than 0.5% difference > >>>>> between runs on the same base), yet I did average 4 runs. > >>>>> > >>>>> Kernel SQPOLL IOPS Perf diff > >>>>> --------------------------------------------------------- > >>>>> 5.13 0 3029872 0.0% > >>>>> 5.13 1 3031056 0.0% > >>>>> 5.13 + audit 0 2894160 -4.5% > >>>>> 5.13 + audit 1 2886168 -4.8% > >>>>> > >>>>> That's an immediate drop in perf of almost 5%. Looking at a quick > >>>>> profile of it (nothing fancy, just checking for 'audit' in the profile) > >>>>> shows this: > >>>>> > >>>>> + 2.17% io_uring [kernel.vmlinux] [k] __audit_uring_entry > >>>>> + 0.71% io_uring [kernel.vmlinux] [k] __audit_uring_exit > >>>>> 0.07% io_uring [kernel.vmlinux] [k] __audit_syscall_entry > >>>>> 0.02% io_uring [kernel.vmlinux] [k] __audit_syscall_exit > >>>>> > >>>>> Note that this is with _no_ rules! > >>>> > >>>> io_uring also supports a NOP command, which basically just measures > >>>> reqs/sec through the interface. Ran that as well: > >>>> > >>>> Kernel SQPOLL IOPS Perf diff > >>>> --------------------------------------------------------- > >>>> 5.13 0 31.05M 0.0% > >>>> 5.13 + audit 0 25.31M -18.5% > >>>> > >>>> and profile for the latter includes: > >>>> > >>>> + 5.19% io_uring [kernel.vmlinux] [k] __audit_uring_entry > >>>> + 4.31% io_uring [kernel.vmlinux] [k] __audit_uring_exit > >>>> 0.26% io_uring [kernel.vmlinux] [k] __audit_syscall_entry > >>>> 0.08% io_uring [kernel.vmlinux] [k] __audit_syscall_exit > >>> > >>> As Pavel correctly pointed it, looks like auditing is enabled. And > >>> indeed it was! Hence the above numbers is without having turned off > >>> auditing. Running the NOPs after having turned off audit, we get 30.6M > >>> IOPS, which is down about 1.5% from the baseline. The results for the > >>> polled random read test above did _not_ change from this, they are still > >>> down the same amount. > >>> > >>> Note, and I should have included this in the first email, this is not > >>> any kind of argument for or against audit logging. It's purely meant to > >>> be a set of numbers that show how the current series impacts > >>> performance. > >> > >> And finally, just checking if we make it optional per opcode if we see > >> any real impact, and the answer is no. Using the below patch which > >> effectively bypasses audit calls unless the opcode has flagged the need > >> to do so, I cannot measure any difference in perf (as expected). > >> > >> To turn this into something useful, my suggestion as a viable path > >> forward would be: > >> > >> 1) Use something like the below patch and flag request types that we > >> want to do audit logging for. > >> > >> 2) As Pavel suggested, eliminate the need for having both and entry/exit > >> hook, turning it into just one. That effectively cuts the number of > >> checks and calls in half. > > > > I suspect the updated working-io_uring branch with HEAD at > > 1f25193a3f54 (updated a short time ago, see my last email in this > > thread) will improve performance. Also, as has been mention several > > See the email you replied to, ~1.5% was basically an overhead of > two `if (io_op_defs[req->opcode].audit)` in case of nops, where at > least once req->opcode is cached. But to be completely fair, misses > unlikely Maybe. I remain skeptical that "io_op_defs[req->opcode].audit" has the same cost as "unlikely(audit_context())".
On 5/26/21 12:44 PM, Paul Moore wrote: > On Wed, May 26, 2021 at 2:01 PM Jens Axboe <axboe@kernel.dk> wrote: >> On 5/26/21 11:54 AM, Jens Axboe wrote: >>> On 5/26/21 11:31 AM, Jens Axboe wrote: >>>> On 5/26/21 11:15 AM, Jens Axboe wrote: >>>>> On 5/25/21 8:04 PM, Paul Moore wrote: >>>>>> On Tue, May 25, 2021 at 9:11 PM Jens Axboe <axboe@kernel.dk> wrote: >>>>>>> On 5/24/21 1:59 PM, Paul Moore wrote: >>>>>>>> That said, audit is not for everyone, and we have build time and >>>>>>>> runtime options to help make life easier. Beyond simply disabling >>>>>>>> audit at compile time a number of Linux distributions effectively >>>>>>>> shortcut audit at runtime by adding a "never" rule to the audit >>>>>>>> filter, for example: >>>>>>>> >>>>>>>> % auditctl -a task,never >>>>>>> >>>>>>> As has been brought up, the issue we're facing is that distros have >>>>>>> CONFIG_AUDIT=y and hence the above is the best real world case outside >>>>>>> of people doing custom kernels. My question would then be how much >>>>>>> overhead the above will add, considering it's an entry/exit call per op. >>>>>>> If auditctl is turned off, what is the expectation in turns of overhead? >>>>>> >>>>>> I commented on that case in my last email to Pavel, but I'll try to go >>>>>> over it again in a little more detail. >>>>>> >>>>>> As we discussed earlier in this thread, we can skip the req->opcode >>>>>> check before both the _entry and _exit calls, so we are left with just >>>>>> the bare audit calls in the io_uring code. As the _entry and _exit >>>>>> functions are small, I've copied them and their supporting functions >>>>>> below and I'll try to explain what would happen in CONFIG_AUDIT=y, >>>>>> "task,never" case. >>>>>> >>>>>> + static inline struct audit_context *audit_context(void) >>>>>> + { >>>>>> + return current->audit_context; >>>>>> + } >>>>>> >>>>>> + static inline bool audit_dummy_context(void) >>>>>> + { >>>>>> + void *p = audit_context(); >>>>>> + return !p || *(int *)p; >>>>>> + } >>>>>> >>>>>> + static inline void audit_uring_entry(u8 op) >>>>>> + { >>>>>> + if (unlikely(audit_enabled && audit_context())) >>>>>> + __audit_uring_entry(op); >>>>>> + } >>>>>> >>>>>> We have one if statement where the conditional checks on two >>>>>> individual conditions. The first (audit_enabled) is simply a check to >>>>>> see if anyone has "turned on" auditing at runtime; historically this >>>>>> worked rather well, and still does in a number of places, but ever >>>>>> since systemd has taken to forcing audit on regardless of the admin's >>>>>> audit configuration it is less useful. The second (audit_context()) >>>>>> is a check to see if an audit_context has been allocated for the >>>>>> current task. In the case of "task,never" current->audit_context will >>>>>> be NULL (see audit_alloc()) and the __audit_uring_entry() slowpath >>>>>> will never be called. >>>>>> >>>>>> Worst case here is checking the value of audit_enabled and >>>>>> current->audit_context. Depending on which you think is more likely >>>>>> we can change the order of the check so that the >>>>>> current->audit_context check is first if you feel that is more likely >>>>>> to be NULL than audit_enabled is to be false (it may be that way now). >>>>>> >>>>>> + static inline void audit_uring_exit(int success, long code) >>>>>> + { >>>>>> + if (unlikely(!audit_dummy_context())) >>>>>> + __audit_uring_exit(success, code); >>>>>> + } >>>>>> >>>>>> The exit call is very similar to the entry call, but in the >>>>>> "task,never" case it is very simple as the first check to be performed >>>>>> is the current->audit_context check which we know to be NULL. The >>>>>> __audit_uring_exit() slowpath will never be called. >>>>> >>>>> I actually ran some numbers this morning. The test base is 5.13+, and >>>>> CONFIG_AUDIT=y and CONFIG_AUDITSYSCALL=y is set for both the baseline >>>>> test and the test with this series applied. I used your git branch as of >>>>> this morning. >>>>> >>>>> The test case is my usual peak perf test, which is random reads at >>>>> QD=128 and using polled IO. It's a single core test, not threaded. I ran >>>>> two different tests - one was having a thread just do the IO, the other >>>>> is using SQPOLL to do the IO for us. The device is capable than more >>>>> IOPS than a single core can deliver, so we're CPU limited in this test. >>>>> Hence it's a good test case as it does actual work, and shows software >>>>> overhead quite nicely. Runs are very stable (less than 0.5% difference >>>>> between runs on the same base), yet I did average 4 runs. >>>>> >>>>> Kernel SQPOLL IOPS Perf diff >>>>> --------------------------------------------------------- >>>>> 5.13 0 3029872 0.0% >>>>> 5.13 1 3031056 0.0% >>>>> 5.13 + audit 0 2894160 -4.5% >>>>> 5.13 + audit 1 2886168 -4.8% >>>>> >>>>> That's an immediate drop in perf of almost 5%. Looking at a quick >>>>> profile of it (nothing fancy, just checking for 'audit' in the profile) >>>>> shows this: >>>>> >>>>> + 2.17% io_uring [kernel.vmlinux] [k] __audit_uring_entry >>>>> + 0.71% io_uring [kernel.vmlinux] [k] __audit_uring_exit >>>>> 0.07% io_uring [kernel.vmlinux] [k] __audit_syscall_entry >>>>> 0.02% io_uring [kernel.vmlinux] [k] __audit_syscall_exit >>>>> >>>>> Note that this is with _no_ rules! >>>> >>>> io_uring also supports a NOP command, which basically just measures >>>> reqs/sec through the interface. Ran that as well: >>>> >>>> Kernel SQPOLL IOPS Perf diff >>>> --------------------------------------------------------- >>>> 5.13 0 31.05M 0.0% >>>> 5.13 + audit 0 25.31M -18.5% >>>> >>>> and profile for the latter includes: >>>> >>>> + 5.19% io_uring [kernel.vmlinux] [k] __audit_uring_entry >>>> + 4.31% io_uring [kernel.vmlinux] [k] __audit_uring_exit >>>> 0.26% io_uring [kernel.vmlinux] [k] __audit_syscall_entry >>>> 0.08% io_uring [kernel.vmlinux] [k] __audit_syscall_exit >>> >>> As Pavel correctly pointed it, looks like auditing is enabled. And >>> indeed it was! Hence the above numbers is without having turned off >>> auditing. Running the NOPs after having turned off audit, we get 30.6M >>> IOPS, which is down about 1.5% from the baseline. The results for the >>> polled random read test above did _not_ change from this, they are still >>> down the same amount. >>> >>> Note, and I should have included this in the first email, this is not >>> any kind of argument for or against audit logging. It's purely meant to >>> be a set of numbers that show how the current series impacts >>> performance. >> >> And finally, just checking if we make it optional per opcode if we see >> any real impact, and the answer is no. Using the below patch which >> effectively bypasses audit calls unless the opcode has flagged the need >> to do so, I cannot measure any difference in perf (as expected). >> >> To turn this into something useful, my suggestion as a viable path >> forward would be: >> >> 1) Use something like the below patch and flag request types that we >> want to do audit logging for. >> >> 2) As Pavel suggested, eliminate the need for having both and entry/exit >> hook, turning it into just one. That effectively cuts the number of >> checks and calls in half. > > I suspect the updated working-io_uring branch with HEAD at > 1f25193a3f54 (updated a short time ago, see my last email in this > thread) will improve performance. Also, as has been mention several > times now, for audit to work we need both the _entry and _exit call. Pulled in your new stuff, and ran a quick test again (same as before, the rand reads). No change, we're still down ~5% with auditctl -a task,never having been run. If I stub out your two audit entry/exit calls by adding an if (0) before them, then perf is back to normal.
On Wed, May 26, 2021 at 3:44 PM Jens Axboe <axboe@kernel.dk> wrote: > On 5/26/21 12:44 PM, Paul Moore wrote: > > On Wed, May 26, 2021 at 2:01 PM Jens Axboe <axboe@kernel.dk> wrote: > >> On 5/26/21 11:54 AM, Jens Axboe wrote: > >>> On 5/26/21 11:31 AM, Jens Axboe wrote: > >>>> On 5/26/21 11:15 AM, Jens Axboe wrote: > >>>>> On 5/25/21 8:04 PM, Paul Moore wrote: > >>>>>> On Tue, May 25, 2021 at 9:11 PM Jens Axboe <axboe@kernel.dk> wrote: > >>>>>>> On 5/24/21 1:59 PM, Paul Moore wrote: > >>>>>>>> That said, audit is not for everyone, and we have build time and > >>>>>>>> runtime options to help make life easier. Beyond simply disabling > >>>>>>>> audit at compile time a number of Linux distributions effectively > >>>>>>>> shortcut audit at runtime by adding a "never" rule to the audit > >>>>>>>> filter, for example: > >>>>>>>> > >>>>>>>> % auditctl -a task,never > >>>>>>> > >>>>>>> As has been brought up, the issue we're facing is that distros have > >>>>>>> CONFIG_AUDIT=y and hence the above is the best real world case outside > >>>>>>> of people doing custom kernels. My question would then be how much > >>>>>>> overhead the above will add, considering it's an entry/exit call per op. > >>>>>>> If auditctl is turned off, what is the expectation in turns of overhead? > >>>>>> > >>>>>> I commented on that case in my last email to Pavel, but I'll try to go > >>>>>> over it again in a little more detail. > >>>>>> > >>>>>> As we discussed earlier in this thread, we can skip the req->opcode > >>>>>> check before both the _entry and _exit calls, so we are left with just > >>>>>> the bare audit calls in the io_uring code. As the _entry and _exit > >>>>>> functions are small, I've copied them and their supporting functions > >>>>>> below and I'll try to explain what would happen in CONFIG_AUDIT=y, > >>>>>> "task,never" case. > >>>>>> > >>>>>> + static inline struct audit_context *audit_context(void) > >>>>>> + { > >>>>>> + return current->audit_context; > >>>>>> + } > >>>>>> > >>>>>> + static inline bool audit_dummy_context(void) > >>>>>> + { > >>>>>> + void *p = audit_context(); > >>>>>> + return !p || *(int *)p; > >>>>>> + } > >>>>>> > >>>>>> + static inline void audit_uring_entry(u8 op) > >>>>>> + { > >>>>>> + if (unlikely(audit_enabled && audit_context())) > >>>>>> + __audit_uring_entry(op); > >>>>>> + } > >>>>>> > >>>>>> We have one if statement where the conditional checks on two > >>>>>> individual conditions. The first (audit_enabled) is simply a check to > >>>>>> see if anyone has "turned on" auditing at runtime; historically this > >>>>>> worked rather well, and still does in a number of places, but ever > >>>>>> since systemd has taken to forcing audit on regardless of the admin's > >>>>>> audit configuration it is less useful. The second (audit_context()) > >>>>>> is a check to see if an audit_context has been allocated for the > >>>>>> current task. In the case of "task,never" current->audit_context will > >>>>>> be NULL (see audit_alloc()) and the __audit_uring_entry() slowpath > >>>>>> will never be called. > >>>>>> > >>>>>> Worst case here is checking the value of audit_enabled and > >>>>>> current->audit_context. Depending on which you think is more likely > >>>>>> we can change the order of the check so that the > >>>>>> current->audit_context check is first if you feel that is more likely > >>>>>> to be NULL than audit_enabled is to be false (it may be that way now). > >>>>>> > >>>>>> + static inline void audit_uring_exit(int success, long code) > >>>>>> + { > >>>>>> + if (unlikely(!audit_dummy_context())) > >>>>>> + __audit_uring_exit(success, code); > >>>>>> + } > >>>>>> > >>>>>> The exit call is very similar to the entry call, but in the > >>>>>> "task,never" case it is very simple as the first check to be performed > >>>>>> is the current->audit_context check which we know to be NULL. The > >>>>>> __audit_uring_exit() slowpath will never be called. > >>>>> > >>>>> I actually ran some numbers this morning. The test base is 5.13+, and > >>>>> CONFIG_AUDIT=y and CONFIG_AUDITSYSCALL=y is set for both the baseline > >>>>> test and the test with this series applied. I used your git branch as of > >>>>> this morning. > >>>>> > >>>>> The test case is my usual peak perf test, which is random reads at > >>>>> QD=128 and using polled IO. It's a single core test, not threaded. I ran > >>>>> two different tests - one was having a thread just do the IO, the other > >>>>> is using SQPOLL to do the IO for us. The device is capable than more > >>>>> IOPS than a single core can deliver, so we're CPU limited in this test. > >>>>> Hence it's a good test case as it does actual work, and shows software > >>>>> overhead quite nicely. Runs are very stable (less than 0.5% difference > >>>>> between runs on the same base), yet I did average 4 runs. > >>>>> > >>>>> Kernel SQPOLL IOPS Perf diff > >>>>> --------------------------------------------------------- > >>>>> 5.13 0 3029872 0.0% > >>>>> 5.13 1 3031056 0.0% > >>>>> 5.13 + audit 0 2894160 -4.5% > >>>>> 5.13 + audit 1 2886168 -4.8% > >>>>> > >>>>> That's an immediate drop in perf of almost 5%. Looking at a quick > >>>>> profile of it (nothing fancy, just checking for 'audit' in the profile) > >>>>> shows this: > >>>>> > >>>>> + 2.17% io_uring [kernel.vmlinux] [k] __audit_uring_entry > >>>>> + 0.71% io_uring [kernel.vmlinux] [k] __audit_uring_exit > >>>>> 0.07% io_uring [kernel.vmlinux] [k] __audit_syscall_entry > >>>>> 0.02% io_uring [kernel.vmlinux] [k] __audit_syscall_exit > >>>>> > >>>>> Note that this is with _no_ rules! > >>>> > >>>> io_uring also supports a NOP command, which basically just measures > >>>> reqs/sec through the interface. Ran that as well: > >>>> > >>>> Kernel SQPOLL IOPS Perf diff > >>>> --------------------------------------------------------- > >>>> 5.13 0 31.05M 0.0% > >>>> 5.13 + audit 0 25.31M -18.5% > >>>> > >>>> and profile for the latter includes: > >>>> > >>>> + 5.19% io_uring [kernel.vmlinux] [k] __audit_uring_entry > >>>> + 4.31% io_uring [kernel.vmlinux] [k] __audit_uring_exit > >>>> 0.26% io_uring [kernel.vmlinux] [k] __audit_syscall_entry > >>>> 0.08% io_uring [kernel.vmlinux] [k] __audit_syscall_exit > >>> > >>> As Pavel correctly pointed it, looks like auditing is enabled. And > >>> indeed it was! Hence the above numbers is without having turned off > >>> auditing. Running the NOPs after having turned off audit, we get 30.6M > >>> IOPS, which is down about 1.5% from the baseline. The results for the > >>> polled random read test above did _not_ change from this, they are still > >>> down the same amount. > >>> > >>> Note, and I should have included this in the first email, this is not > >>> any kind of argument for or against audit logging. It's purely meant to > >>> be a set of numbers that show how the current series impacts > >>> performance. > >> > >> And finally, just checking if we make it optional per opcode if we see > >> any real impact, and the answer is no. Using the below patch which > >> effectively bypasses audit calls unless the opcode has flagged the need > >> to do so, I cannot measure any difference in perf (as expected). > >> > >> To turn this into something useful, my suggestion as a viable path > >> forward would be: > >> > >> 1) Use something like the below patch and flag request types that we > >> want to do audit logging for. > >> > >> 2) As Pavel suggested, eliminate the need for having both and entry/exit > >> hook, turning it into just one. That effectively cuts the number of > >> checks and calls in half. > > > > I suspect the updated working-io_uring branch with HEAD at > > 1f25193a3f54 (updated a short time ago, see my last email in this > > thread) will improve performance. Also, as has been mention several > > times now, for audit to work we need both the _entry and _exit call. > > Pulled in your new stuff, and ran a quick test again (same as before, > the rand reads). No change, we're still down ~5% with auditctl -a > task,never having been run. If I stub out your two audit entry/exit > calls by adding an if (0) before them, then perf is back to normal. That's fun. Considering that we are down to basically a single if-conditional in the audit code and your other tests with another single conditional yielded no significant change it looks like we really only have one option left before we "agree to disagree" and have to mark io_uring and audit as mutually exclusive in Kconfig. If we moved the _entry and _exit calls into the individual operation case blocks (quick openat example below) so that only certain operations were able to be audited would that be acceptable assuming the high frequency ops were untouched? My initial gut feeling was that this would involve >50% of the ops, but Steve Grubb seems to think it would be less; it may be time to look at that a bit more seriously, but if it gets a NACK regardless it isn't worth the time - thoughts? case IORING_OP_OPENAT: audit_uring_entry(req->opcode); ret = io_openat(req, issue_flags); audit_uring_exit(!ret, ret); break;
On 2021-05-26 11:22, Jens Axboe wrote: > On 5/26/21 9:49 AM, Richard Guy Briggs wrote: > >> So why is there anything special needed for io_uring (now that the > >> native worker threads are used)? > > > > Because syscall has been bypassed by a memory-mapped work queue. > > I don't follow this one at all, that's just the delivery mechanism if > you choose to use SQPOLL. If you do, then a thread sibling of the > original task does the actual system call. There's no magic involved > there, and the tasks are related. > > So care to expand on that? These may be poor examples, but hear me out... In the case of an open call, I'm guessing that they are mapped 1:1 syscall to io_uring action so that the action can be asynchronous. So in this case, there is a record of the action, but we don't see the result success/failure. I assume that file paths can only be specified in the original syscall and not in an SQPOLL action? In the case of a syscall read/write (which aren't as interesting generally to audit, but I'm concerned there are other similar situations that are), the syscall would be called for every buffer that is passed back and forth user/kernel and vice versa, providing a way to log all that activity. In the case of SQPOLL, I understand that a syscall sets up the action and then io_uring takes over and does the bulk transfer and we'd not have the visibility of how many times that action was repeated nor that the result success/failure was due to its asynchrony. Perhaps I am showing my ignorance, so please correct me if I have it wrong. > >> Is there really any io_uring opcode that bypasses the security checks the corresponding native syscall > >> would do? If so, I think that should just be fixed... > > > > This is by design to speed it up. This is what Paul's iouring entry and > > exit hooks do. > > As far as I can tell, we're doing double logging at that point, if the > syscall helper does the audit as well. We'll get something logging the > io_uring opcode (eg IORING_OP_OPENAT2), then log again when we call the > fs helper. That's _assuming_ that we always hit the logging part when we > call into the vfs, but that's something that can be updated to be true > and kept an eye on for future additions. > > Why is the double logging useful? It only tells you that the invocation > was via io_uring as the delivery mechanism rather than the usual system > call, but the effect is the same - the file is opened, for example. > > I feel like I'm missing something here, or the other side is. Or both! Paul addressed this in his reply, but let me add a more concrete example... There was one corner case I was looking at that showed up this issue. Please indicate if I have mischaracterized or misunderstood. A syscall would generate records something like this: AUDIT_SYSCALL AUDIT_... AUDIT_EOE A io_uring SQPOLL event would generate something like this: AUDIT_URINGOP AUDIT_... AUDIT_EOE The "hybrid" event that is a syscall that starts an io_uring action would generate something like this: AUDIT_URINGOP [possible AUDIT_CONFIG_CHANGE (from killed_trees)] AUDIT_SYSCALL AUDIT_... AUDIT_EOE The AUDIT_... is all the operation-specific records that log parameters that aren't able to be expressed in the SYSLOG or URINGOP records such as pathnames, other arguments, and context (pwd, etc...). So this isn't "double logging". It is either introducing an io_uring event, or it is providing more detail about the io_uring arguments to a syscall event. > Jens Axboe - RGB -- Richard Guy Briggs <rgb@redhat.com> Sr. S/W Engineer, Kernel Security, Base Operating Systems Remote, Ottawa, Red Hat Canada IRC: rgb, SunRaycer Voice: +1.647.777.2635, Internal: (81) 32635
On Wed, May 26, 2021 at 4:19 PM Paul Moore <paul@paul-moore.com> wrote: > ... If we moved the _entry > and _exit calls into the individual operation case blocks (quick > openat example below) so that only certain operations were able to be > audited would that be acceptable assuming the high frequency ops were > untouched? My initial gut feeling was that this would involve >50% of > the ops, but Steve Grubb seems to think it would be less; it may be > time to look at that a bit more seriously, but if it gets a NACK > regardless it isn't worth the time - thoughts? > > case IORING_OP_OPENAT: > audit_uring_entry(req->opcode); > ret = io_openat(req, issue_flags); > audit_uring_exit(!ret, ret); > break; I wanted to pose this question again in case it was lost in the thread, I suspect this may be the last option before we have to "fix" things at the Kconfig level. I definitely don't want to have to go that route, and I suspect most everyone on this thread feels the same, so I'm hopeful we can find a solution that is begrudgingly acceptable to both groups.
On 5/28/21 5:02 PM, Paul Moore wrote: > On Wed, May 26, 2021 at 4:19 PM Paul Moore <paul@paul-moore.com> wrote: >> ... If we moved the _entry >> and _exit calls into the individual operation case blocks (quick >> openat example below) so that only certain operations were able to be >> audited would that be acceptable assuming the high frequency ops were >> untouched? My initial gut feeling was that this would involve >50% of >> the ops, but Steve Grubb seems to think it would be less; it may be >> time to look at that a bit more seriously, but if it gets a NACK >> regardless it isn't worth the time - thoughts? >> >> case IORING_OP_OPENAT: >> audit_uring_entry(req->opcode); >> ret = io_openat(req, issue_flags); >> audit_uring_exit(!ret, ret); >> break; > > I wanted to pose this question again in case it was lost in the > thread, I suspect this may be the last option before we have to "fix" > things at the Kconfig level. I definitely don't want to have to go > that route, and I suspect most everyone on this thread feels the same, > so I'm hopeful we can find a solution that is begrudgingly acceptable > to both groups. May work for me, but have to ask how many, and what is the criteria? I'd think anything opening a file or manipulating fs: IORING_OP_ACCEPT, IORING_OP_CONNECT, IORING_OP_OPENAT[2], IORING_OP_RENAMEAT, IORING_OP_UNLINKAT, IORING_OP_SHUTDOWN, IORING_OP_FILES_UPDATE + coming mkdirat and others. IORING_OP_CLOSE? IORING_OP_SEND IORING_OP_RECV? What about? IORING_OP_FSYNC, IORING_OP_SYNC_FILE_RANGE, IORING_OP_FALLOCATE, IORING_OP_STATX, IORING_OP_FADVISE, IORING_OP_MADVISE, IORING_OP_EPOLL_CTL Another question, io_uring may exercise asynchronous paths, i.e. io_issue_sqe() returns before requests completes. Shouldn't be the case for open/etc at the moment, but was that considered? I don't see it happening, but would prefer to keep it open async reimplementation in a distant future. Does audit sleep?
On 2021-06-02 09:26, Pavel Begunkov wrote: > On 5/28/21 5:02 PM, Paul Moore wrote: > > On Wed, May 26, 2021 at 4:19 PM Paul Moore <paul@paul-moore.com> wrote: > >> ... If we moved the _entry > >> and _exit calls into the individual operation case blocks (quick > >> openat example below) so that only certain operations were able to be > >> audited would that be acceptable assuming the high frequency ops were > >> untouched? My initial gut feeling was that this would involve >50% of > >> the ops, but Steve Grubb seems to think it would be less; it may be > >> time to look at that a bit more seriously, but if it gets a NACK > >> regardless it isn't worth the time - thoughts? > >> > >> case IORING_OP_OPENAT: > >> audit_uring_entry(req->opcode); > >> ret = io_openat(req, issue_flags); > >> audit_uring_exit(!ret, ret); > >> break; > > > > I wanted to pose this question again in case it was lost in the > > thread, I suspect this may be the last option before we have to "fix" > > things at the Kconfig level. I definitely don't want to have to go > > that route, and I suspect most everyone on this thread feels the same, > > so I'm hopeful we can find a solution that is begrudgingly acceptable > > to both groups. > > May work for me, but have to ask how many, and what is the > criteria? I'd think anything opening a file or manipulating fs: > > IORING_OP_ACCEPT, IORING_OP_CONNECT, IORING_OP_OPENAT[2], > IORING_OP_RENAMEAT, IORING_OP_UNLINKAT, IORING_OP_SHUTDOWN, > IORING_OP_FILES_UPDATE > + coming mkdirat and others. > > IORING_OP_CLOSE? IORING_OP_SEND IORING_OP_RECV? > > What about? > IORING_OP_FSYNC, IORING_OP_SYNC_FILE_RANGE, > IORING_OP_FALLOCATE, IORING_OP_STATX, > IORING_OP_FADVISE, IORING_OP_MADVISE, > IORING_OP_EPOLL_CTL > > > Another question, io_uring may exercise asynchronous paths, > i.e. io_issue_sqe() returns before requests completes. > Shouldn't be the case for open/etc at the moment, but was that > considered? This would be why audit needs to monitor a thread until it wraps up, to wait for the result code. My understanding is that both sync and async parts of an op would be monitored. > I don't see it happening, but would prefer to keep it open > async reimplementation in a distant future. Does audit sleep? Some parts do, some parts don't depending on what they are interacting with in the kernel. It can be made to not sleep if needed. > Pavel Begunkov - RGB -- Richard Guy Briggs <rgb@redhat.com> Sr. S/W Engineer, Kernel Security, Base Operating Systems Remote, Ottawa, Red Hat Canada IRC: rgb, SunRaycer Voice: +1.647.777.2635, Internal: (81) 32635
On 2021-05-21 17:49, Paul Moore wrote: > WARNING - This is a work in progress and should not be merged > anywhere important. It is almost surely not complete, and while it > probably compiles it likely hasn't been booted and will do terrible > things. You have been warned. > > This patch adds basic auditing to io_uring operations, regardless of > their context. This is accomplished by allocating audit_context > structures for the io-wq worker and io_uring SQPOLL kernel threads > as well as explicitly auditing the io_uring operations in > io_issue_sqe(). The io_uring operations are audited using a new > AUDIT_URINGOP record, an example is shown below: > > % <TODO - insert AUDIT_URINGOP record example> > > Thanks to Richard Guy Briggs for review and feedback. > > Signed-off-by: Paul Moore <paul@paul-moore.com> > --- > fs/io-wq.c | 4 + > fs/io_uring.c | 11 +++ > include/linux/audit.h | 17 ++++ > include/uapi/linux/audit.h | 1 > kernel/audit.h | 2 + > kernel/auditsc.c | 173 ++++++++++++++++++++++++++++++++++++++++++++ > 6 files changed, 208 insertions(+) > > diff --git a/fs/io-wq.c b/fs/io-wq.c > index 5361a9b4b47b..8af09a3336e0 100644 > --- a/fs/io-wq.c > +++ b/fs/io-wq.c > @@ -16,6 +16,7 @@ > #include <linux/rculist_nulls.h> > #include <linux/cpu.h> > #include <linux/tracehook.h> > +#include <linux/audit.h> > > #include "io-wq.h" > > @@ -535,6 +536,8 @@ static int io_wqe_worker(void *data) > snprintf(buf, sizeof(buf), "iou-wrk-%d", wq->task->pid); > set_task_comm(current, buf); > > + audit_alloc_kernel(current); > + > while (!test_bit(IO_WQ_BIT_EXIT, &wq->state)) { > long ret; > > @@ -573,6 +576,7 @@ static int io_wqe_worker(void *data) > raw_spin_unlock_irq(&wqe->lock); > } > > + audit_free(current); > io_worker_exit(worker); > return 0; > } > diff --git a/fs/io_uring.c b/fs/io_uring.c > index e481ac8a757a..e9941d1ad8fd 100644 > --- a/fs/io_uring.c > +++ b/fs/io_uring.c > @@ -78,6 +78,7 @@ > #include <linux/task_work.h> > #include <linux/pagemap.h> > #include <linux/io_uring.h> > +#include <linux/audit.h> > > #define CREATE_TRACE_POINTS > #include <trace/events/io_uring.h> > @@ -6105,6 +6106,9 @@ static int io_issue_sqe(struct io_kiocb *req, unsigned int issue_flags) > if (req->work.creds && req->work.creds != current_cred()) > creds = override_creds(req->work.creds); > > + if (req->opcode < IORING_OP_LAST) > + audit_uring_entry(req->opcode); > + > switch (req->opcode) { > case IORING_OP_NOP: > ret = io_nop(req, issue_flags); > @@ -6211,6 +6215,9 @@ static int io_issue_sqe(struct io_kiocb *req, unsigned int issue_flags) > break; > } > > + if (req->opcode < IORING_OP_LAST) > + audit_uring_exit(!ret, ret); > + > if (creds) > revert_creds(creds); > > @@ -6827,6 +6834,8 @@ static int io_sq_thread(void *data) > set_cpus_allowed_ptr(current, cpu_online_mask); > current->flags |= PF_NO_SETAFFINITY; > > + audit_alloc_kernel(current); > + > mutex_lock(&sqd->lock); > /* a user may had exited before the thread started */ > io_run_task_work_head(&sqd->park_task_work); > @@ -6916,6 +6925,8 @@ static int io_sq_thread(void *data) > io_run_task_work_head(&sqd->park_task_work); > mutex_unlock(&sqd->lock); > > + audit_free(current); > + > complete(&sqd->exited); > do_exit(0); > } > diff --git a/include/linux/audit.h b/include/linux/audit.h > index 82b7c1116a85..6a0c013bc7de 100644 > --- a/include/linux/audit.h > +++ b/include/linux/audit.h > @@ -286,7 +286,10 @@ static inline int audit_signal_info(int sig, struct task_struct *t) > /* These are defined in auditsc.c */ > /* Public API */ > extern int audit_alloc(struct task_struct *task); > +extern int audit_alloc_kernel(struct task_struct *task); > extern void __audit_free(struct task_struct *task); > +extern void __audit_uring_entry(u8 op); > +extern void __audit_uring_exit(int success, long code); > extern void __audit_syscall_entry(int major, unsigned long a0, unsigned long a1, > unsigned long a2, unsigned long a3); > extern void __audit_syscall_exit(int ret_success, long ret_value); > @@ -323,6 +326,16 @@ static inline void audit_free(struct task_struct *task) > if (unlikely(task->audit_context)) > __audit_free(task); > } > +static inline void audit_uring_entry(u8 op) > +{ > + if (unlikely(audit_context())) > + __audit_uring_entry(op); > +} > +static inline void audit_uring_exit(int success, long code) > +{ > + if (unlikely(audit_context())) > + __audit_uring_exit(success, code); > +} > static inline void audit_syscall_entry(int major, unsigned long a0, > unsigned long a1, unsigned long a2, > unsigned long a3) > @@ -554,6 +567,10 @@ static inline int audit_alloc(struct task_struct *task) > { > return 0; > } > +static inline int audit_alloc_kernel(struct task_struct *task) > +{ > + return 0; > +} > static inline void audit_free(struct task_struct *task) > { } > static inline void audit_syscall_entry(int major, unsigned long a0, > diff --git a/include/uapi/linux/audit.h b/include/uapi/linux/audit.h > index cd2d8279a5e4..b26e0c435e8b 100644 > --- a/include/uapi/linux/audit.h > +++ b/include/uapi/linux/audit.h > @@ -118,6 +118,7 @@ > #define AUDIT_TIME_ADJNTPVAL 1333 /* NTP value adjustment */ > #define AUDIT_BPF 1334 /* BPF subsystem */ > #define AUDIT_EVENT_LISTENER 1335 /* Task joined multicast read socket */ > +#define AUDIT_URINGOP 1336 /* io_uring operation */ > > #define AUDIT_AVC 1400 /* SE Linux avc denial or grant */ > #define AUDIT_SELINUX_ERR 1401 /* Internal SE Linux Errors */ > diff --git a/kernel/audit.h b/kernel/audit.h > index fba180de5912..50de827497ca 100644 > --- a/kernel/audit.h > +++ b/kernel/audit.h > @@ -100,10 +100,12 @@ struct audit_context { > enum { > AUDIT_CTX_UNUSED, /* audit_context is currently unused */ > AUDIT_CTX_SYSCALL, /* in use by syscall */ > + AUDIT_CTX_URING, /* in use by io_uring */ > } context; > enum audit_state state, current_state; > unsigned int serial; /* serial number for record */ > int major; /* syscall number */ > + int uring_op; /* uring operation */ > struct timespec64 ctime; /* time of syscall entry */ > unsigned long argv[4]; /* syscall arguments */ > long return_code;/* syscall return code */ > diff --git a/kernel/auditsc.c b/kernel/auditsc.c > index cc89e9f9a753..729849d41631 100644 > --- a/kernel/auditsc.c > +++ b/kernel/auditsc.c > @@ -953,6 +953,7 @@ static void audit_reset_context(struct audit_context *ctx) > ctx->current_state = ctx->state; > ctx->serial = 0; > ctx->major = 0; > + ctx->uring_op = 0; > ctx->ctime = (struct timespec64){ .tv_sec = 0, .tv_nsec = 0 }; > memset(ctx->argv, 0, sizeof(ctx->argv)); > ctx->return_code = 0; > @@ -1038,6 +1039,31 @@ int audit_alloc(struct task_struct *tsk) > return 0; > } > > +/** > + * audit_alloc_kernel - allocate an audit_context for a kernel task > + * @tsk: the kernel task > + * > + * Similar to the audit_alloc() function, but intended for kernel private > + * threads. Returns zero on success, negative values on failure. > + */ > +int audit_alloc_kernel(struct task_struct *tsk) > +{ > + /* > + * At the moment we are just going to call into audit_alloc() to > + * simplify the code, but there two things to keep in mind with this > + * approach: > + * > + * 1. Filtering internal kernel tasks is a bit laughable in almost all > + * cases, but there is at least one case where there is a benefit: > + * the '-a task,never' case allows the admin to effectively disable > + * task auditing at runtime. > + * > + * 2. The {set,clear}_task_syscall_work() ops likely have zero effect > + * on these internal kernel tasks, but they probably don't hurt either. > + */ > + return audit_alloc(tsk); > +} > + > static inline void audit_free_context(struct audit_context *context) > { > /* resetting is extra work, but it is likely just noise */ > @@ -1536,6 +1562,52 @@ static void audit_log_proctitle(void) > audit_log_end(ab); > } > > +/** > + * audit_log_uring - generate a AUDIT_URINGOP record > + * @ctx: the audit context > + */ > +static void audit_log_uring(struct audit_context *ctx) > +{ > + struct audit_buffer *ab; > + const struct cred *cred; > + > + /* > + * TODO: What do we log here? I'm tossing in a few things to start the > + * conversation, but additional thought needs to go into this. > + */ > + > + ab = audit_log_start(ctx, GFP_KERNEL, AUDIT_URINGOP); > + if (!ab) > + return; > + cred = current_cred(); This may need to be req->work.creds. I haven't been following if the io_uring thread inherited the user task's creds (and below, comm and exe). > + audit_log_format(ab, "uring_op=%d", ctx->uring_op); arch is stored below in __audit_uring_entry() and never used in the AUDIT_CTX_URING case. That assignment can either be dropped or printed before uring_op similar to the SYSCALL record. There aren't really any arg[0-3] to print. io_uring_register and io_uring_setup() args are better covered by other records. io_uring_enter() has 6 args and the last two aren't covered by SYSCALL anyways. > + if (ctx->return_valid != AUDITSC_INVALID) > + audit_log_format(ab, " success=%s exit=%ld", > + (ctx->return_valid == AUDITSC_SUCCESS ? > + "yes" : "no"), > + ctx->return_code); > + audit_log_format(ab, > + " items=%d" > + " ppid=%d pid=%d auid=%u uid=%u gid=%u" > + " euid=%u suid=%u fsuid=%u" > + " egid=%u sgid=%u fsgid=%u", > + ctx->name_count, > + task_ppid_nr(current), > + task_tgid_nr(current), > + from_kuid(&init_user_ns, audit_get_loginuid(current)), > + from_kuid(&init_user_ns, cred->uid), > + from_kgid(&init_user_ns, cred->gid), > + from_kuid(&init_user_ns, cred->euid), > + from_kuid(&init_user_ns, cred->suid), > + from_kuid(&init_user_ns, cred->fsuid), > + from_kgid(&init_user_ns, cred->egid), > + from_kgid(&init_user_ns, cred->sgid), > + from_kgid(&init_user_ns, cred->fsgid)); The audit session ID is still important, relevant and qualifies auid. In keeping with the SYSCALL record format, I think we want to keep ses=audit_get_sessionid(current) in here. I'm pretty sure we also want to keep comm= and exe= too, but may have to reach into req->task to get it. There are two values for comm possible, one from the original task and second "iou-sqp-<pid>" set at the top of io_sq_thread(). I'm reluctant to leave them out now and then have to re-add them in yet another field order later. > + audit_log_task_context(ab); > + audit_log_key(ab, ctx->filterkey); > + audit_log_end(ab); > +} > + > static void audit_log_exit(void) > { > int i, call_panic = 0; > @@ -1571,6 +1643,9 @@ static void audit_log_exit(void) > audit_log_key(ab, context->filterkey); > audit_log_end(ab); > break; > + case AUDIT_CTX_URING: > + audit_log_uring(context); > + break; > default: > BUG(); > break; > @@ -1740,6 +1815,104 @@ static void audit_return_fixup(struct audit_context *ctx, > ctx->return_valid = (success ? AUDITSC_SUCCESS : AUDITSC_FAILURE); > } > > +/** > + * __audit_uring_entry - prepare the kernel task's audit context for io_uring > + * @op: the io_uring opcode > + * > + * This is similar to audit_syscall_entry() but is intended for use by io_uring > + * operations. > + */ > +void __audit_uring_entry(u8 op) > +{ > + struct audit_context *ctx = audit_context(); > + > + if (!audit_enabled || !ctx || ctx->state == AUDIT_DISABLED) > + return; > + > + /* > + * NOTE: It's possible that we can be called from the process' context > + * before it returns to userspace, and before audit_syscall_exit() > + * is called. In this case there is not much to do, just record > + * the io_uring details and return. > + */ > + ctx->uring_op = op; > + if (ctx->context == AUDIT_CTX_SYSCALL) > + return; > + > + ctx->dummy = !audit_n_rules; > + if (!ctx->dummy && ctx->state == AUDIT_BUILD_CONTEXT) > + ctx->prio = 0; > + > + ctx->arch = syscall_get_arch(current); > + ctx->context = AUDIT_CTX_URING; > + ctx->current_state = ctx->state; > + ktime_get_coarse_real_ts64(&ctx->ctime); > +} > + > +/** > + * __audit_uring_exit - wrap up the kernel task's audit context after io_uring > + * @success: true/false value to indicate if the operation succeeded or not > + * @code: operation return code > + * > + * This is similar to audit_syscall_exit() but is intended for use by io_uring > + * operations. > + */ > +void __audit_uring_exit(int success, long code) > +{ > + struct audit_context *ctx = audit_context(); > + > + /* > + * TODO: At some point we will likely want to filter on io_uring ops > + * and other things similar to what we do for syscalls, but that > + * is something for another day; just record what we can here. > + */ > + > + if (!ctx || ctx->dummy) > + goto out; > + if (ctx->context == AUDIT_CTX_SYSCALL) { > + /* > + * NOTE: See the note in __audit_uring_entry() about the case > + * where we may be called from process context before we > + * return to userspace via audit_syscall_exit(). In this > + * case we simply emit a URINGOP record and bail, the > + * normal syscall exit handling will take care of > + * everything else. > + * It is also worth mentioning that when we are called, > + * the current process creds may differ from the creds > + * used during the normal syscall processing; keep that > + * in mind if/when we move the record generation code. > + */ > + > + /* > + * We need to filter on the syscall info here to decide if we > + * should emit a URINGOP record. I know it seems odd but this > + * solves the problem where users have a filter to block *all* > + * syscall records in the "exit" filter; we want to preserve > + * the behavior here. > + */ > + audit_filter_syscall(current, ctx); > + audit_filter_inodes(current, ctx); > + if (ctx->current_state != AUDIT_RECORD_CONTEXT) > + return; > + > + audit_log_uring(ctx); > + return; > + } > + > + /* this may generate CONFIG_CHANGE records */ > + if (!list_empty(&ctx->killed_trees)) > + audit_kill_trees(ctx); > + > + audit_filter_inodes(current, ctx); > + if (ctx->current_state != AUDIT_RECORD_CONTEXT) > + goto out; > + audit_return_fixup(ctx, success, code); > + audit_log_exit(); > + > +out: > + audit_reset_context(ctx); > +} > + > /** > * __audit_syscall_entry - fill in an audit record at syscall entry > * @major: major syscall type (function) > > -- > Linux-audit mailing list > Linux-audit@redhat.com > https://listman.redhat.com/mailman/listinfo/linux-audit - RGB -- Richard Guy Briggs <rgb@redhat.com> Sr. S/W Engineer, Kernel Security, Base Operating Systems Remote, Ottawa, Red Hat Canada IRC: rgb, SunRaycer Voice: +1.647.777.2635, Internal: (81) 32635
On Wed, Jun 2, 2021 at 4:27 AM Pavel Begunkov <asml.silence@gmail.com> wrote: > On 5/28/21 5:02 PM, Paul Moore wrote: > > On Wed, May 26, 2021 at 4:19 PM Paul Moore <paul@paul-moore.com> wrote: > >> ... If we moved the _entry > >> and _exit calls into the individual operation case blocks (quick > >> openat example below) so that only certain operations were able to be > >> audited would that be acceptable assuming the high frequency ops were > >> untouched? My initial gut feeling was that this would involve >50% of > >> the ops, but Steve Grubb seems to think it would be less; it may be > >> time to look at that a bit more seriously, but if it gets a NACK > >> regardless it isn't worth the time - thoughts? > >> > >> case IORING_OP_OPENAT: > >> audit_uring_entry(req->opcode); > >> ret = io_openat(req, issue_flags); > >> audit_uring_exit(!ret, ret); > >> break; > > > > I wanted to pose this question again in case it was lost in the > > thread, I suspect this may be the last option before we have to "fix" > > things at the Kconfig level. I definitely don't want to have to go > > that route, and I suspect most everyone on this thread feels the same, > > so I'm hopeful we can find a solution that is begrudgingly acceptable > > to both groups. > > May work for me, but have to ask how many, and what is the > criteria? I'd think anything opening a file or manipulating fs: > > IORING_OP_ACCEPT, IORING_OP_CONNECT, IORING_OP_OPENAT[2], > IORING_OP_RENAMEAT, IORING_OP_UNLINKAT, IORING_OP_SHUTDOWN, > IORING_OP_FILES_UPDATE > + coming mkdirat and others. > > IORING_OP_CLOSE? IORING_OP_SEND IORING_OP_RECV? > > What about? > IORING_OP_FSYNC, IORING_OP_SYNC_FILE_RANGE, > IORING_OP_FALLOCATE, IORING_OP_STATX, > IORING_OP_FADVISE, IORING_OP_MADVISE, > IORING_OP_EPOLL_CTL Looking quickly at v5.13-rc4 the following seems like candidates for auditing, there may be a small number of subtractions/additions to this list as people take a closer look, but it should serve as a starting point: IORING_OP_SENDMSG IORING_OP_RECVMSG IORING_OP_ACCEPT IORING_OP_CONNECT IORING_OP_FALLOCATE IORING_OP_OPENAT IORING_OP_CLOSE IORING_OP_MADVISE IORING_OP_OPENAT2 IORING_OP_SHUTDOWN IORING_OP_RENAMEAT IORING_OP_UNLINKAT ... can you live with that list? > Another question, io_uring may exercise asynchronous paths, > i.e. io_issue_sqe() returns before requests completes. > Shouldn't be the case for open/etc at the moment, but was that > considered? Yes, I noticed that when testing the code (and it makes sense when you look at how io_uring handles things). Depending on the state of the system when the io_uring request is submitted I've seen both sync and async io_uring operations with the associated different calling contexts. In the case where io_issue_sqe() needs to defer the operation to a different context you will see an audit record indicating that the operation failed and then another audit record when it completes; it's actually pretty interesting to be able to see how the system and io_uring are working. We could always mask out these delayed attempts, but at this early stage they are helpful, and they may be useful for admins. > I don't see it happening, but would prefer to keep it open > async reimplementation in a distant future. Does audit sleep? The only place in the audit_uring_entry()/audit_uring_exit() code path that could sleep at present is the call to audit_log_uring() which is made when the rules dictate that an audit record be generated. The offending code is an allocation in audit_log_uring() which is currently GFP_KERNEL but really should be GFP_ATOMIC, or similar. It was a copy-n-paste from the similar syscall function where GFP_KERNEL is appropriate due to the calling context at the end of the syscall. I'll change that as soon as I'm done with this email. Of course if you are calling io_uring_enter(2), or something similar, then audit may sleep as part of the normal syscall processing (as mentioned above), but that is due to the fact that io_uring_enter(2) is a syscall and not because of anything in io_issue_sqe().
On Wed, Jun 2, 2021 at 1:29 PM Richard Guy Briggs <rgb@redhat.com> wrote: > On 2021-05-21 17:49, Paul Moore wrote: > > WARNING - This is a work in progress and should not be merged > > anywhere important. It is almost surely not complete, and while it > > probably compiles it likely hasn't been booted and will do terrible > > things. You have been warned. > > > > This patch adds basic auditing to io_uring operations, regardless of > > their context. This is accomplished by allocating audit_context > > structures for the io-wq worker and io_uring SQPOLL kernel threads > > as well as explicitly auditing the io_uring operations in > > io_issue_sqe(). The io_uring operations are audited using a new > > AUDIT_URINGOP record, an example is shown below: > > > > % <TODO - insert AUDIT_URINGOP record example> > > > > Thanks to Richard Guy Briggs for review and feedback. > > > > Signed-off-by: Paul Moore <paul@paul-moore.com> > > --- > > fs/io-wq.c | 4 + > > fs/io_uring.c | 11 +++ > > include/linux/audit.h | 17 ++++ > > include/uapi/linux/audit.h | 1 > > kernel/audit.h | 2 + > > kernel/auditsc.c | 173 ++++++++++++++++++++++++++++++++++++++++++++ > > 6 files changed, 208 insertions(+) ... > > diff --git a/fs/io_uring.c b/fs/io_uring.c > > index e481ac8a757a..e9941d1ad8fd 100644 > > --- a/fs/io_uring.c > > +++ b/fs/io_uring.c > > @@ -78,6 +78,7 @@ > > #include <linux/task_work.h> > > #include <linux/pagemap.h> > > #include <linux/io_uring.h> > > +#include <linux/audit.h> > > > > #define CREATE_TRACE_POINTS > > #include <trace/events/io_uring.h> > > @@ -6105,6 +6106,9 @@ static int io_issue_sqe(struct io_kiocb *req, unsigned int issue_flags) > > if (req->work.creds && req->work.creds != current_cred()) > > creds = override_creds(req->work.creds); > > > > + if (req->opcode < IORING_OP_LAST) > > + audit_uring_entry(req->opcode); Note well the override_creds() call right above the audit code that is being added, it will be important later in this email. > > diff --git a/kernel/auditsc.c b/kernel/auditsc.c > > index cc89e9f9a753..729849d41631 100644 > > --- a/kernel/auditsc.c > > +++ b/kernel/auditsc.c > > @@ -1536,6 +1562,52 @@ static void audit_log_proctitle(void) > > audit_log_end(ab); > > } > > > > +/** > > + * audit_log_uring - generate a AUDIT_URINGOP record > > + * @ctx: the audit context > > + */ > > +static void audit_log_uring(struct audit_context *ctx) > > +{ > > + struct audit_buffer *ab; > > + const struct cred *cred; > > + > > + /* > > + * TODO: What do we log here? I'm tossing in a few things to start the > > + * conversation, but additional thought needs to go into this. > > + */ > > + > > + ab = audit_log_start(ctx, GFP_KERNEL, AUDIT_URINGOP); > > + if (!ab) > > + return; > > + cred = current_cred(); > > This may need to be req->work.creds. I haven't been following if the > io_uring thread inherited the user task's creds (and below, comm and > exe). Nope, we're good. See the existing code in io_issue_sqe() :) > > + audit_log_format(ab, "uring_op=%d", ctx->uring_op); > > arch is stored below in __audit_uring_entry() and never used in the > AUDIT_CTX_URING case. That assignment can either be dropped or printed > before uring_op similar to the SYSCALL record. Good point, I'll drop the code that records the arch from _entry(); it is really only useful to give the appropriate context if needed for other things in the audit stream, and that isn't the case like it is with syscalls. > There aren't really any arg[0-3] to print. Which is why I didn't print them. > io_uring_register and io_uring_setup() args are better covered by other > records. io_uring_enter() has 6 args and the last two aren't covered by > SYSCALL anyways. ??? I think you are confusing the io_uring ops with syscalls; they are very different things from an audit perspective and the io_uring auditing is not intended to replace syscall records. The io_uring_setup() and io_uring_enter() syscalls will be auditing just as any other syscalls would be using the existing syscall audit code. > > + if (ctx->return_valid != AUDITSC_INVALID) > > + audit_log_format(ab, " success=%s exit=%ld", > > + (ctx->return_valid == AUDITSC_SUCCESS ? > > + "yes" : "no"), > > + ctx->return_code); > > + audit_log_format(ab, > > + " items=%d" > > + " ppid=%d pid=%d auid=%u uid=%u gid=%u" > > + " euid=%u suid=%u fsuid=%u" > > + " egid=%u sgid=%u fsgid=%u", > > + ctx->name_count, > > + task_ppid_nr(current), > > + task_tgid_nr(current), > > + from_kuid(&init_user_ns, audit_get_loginuid(current)), > > + from_kuid(&init_user_ns, cred->uid), > > + from_kgid(&init_user_ns, cred->gid), > > + from_kuid(&init_user_ns, cred->euid), > > + from_kuid(&init_user_ns, cred->suid), > > + from_kuid(&init_user_ns, cred->fsuid), > > + from_kgid(&init_user_ns, cred->egid), > > + from_kgid(&init_user_ns, cred->sgid), > > + from_kgid(&init_user_ns, cred->fsgid)); > > The audit session ID is still important, relevant and qualifies auid. > In keeping with the SYSCALL record format, I think we want to keep > ses=audit_get_sessionid(current) in here. This might be another case of syscall/io_uring confusion. An io_uring op doesn't necessarily have an audit session ID or an audit UID in the conventional sense; for example think about SQPOLL works, shared rings, etc. > I'm pretty sure we also want to keep comm= and exe= too, but may have to > reach into req->task to get it. There are two values for comm possible, > one from the original task and second "iou-sqp-<pid>" set at the top of > io_sq_thread(). I think this is more syscall/io_uring confusion.
On 6/2/21 4:46 PM, Richard Guy Briggs wrote: > On 2021-06-02 09:26, Pavel Begunkov wrote: >> On 5/28/21 5:02 PM, Paul Moore wrote: >>> On Wed, May 26, 2021 at 4:19 PM Paul Moore <paul@paul-moore.com> wrote: >>>> ... If we moved the _entry >>>> and _exit calls into the individual operation case blocks (quick >>>> openat example below) so that only certain operations were able to be >>>> audited would that be acceptable assuming the high frequency ops were >>>> untouched? My initial gut feeling was that this would involve >50% of >>>> the ops, but Steve Grubb seems to think it would be less; it may be >>>> time to look at that a bit more seriously, but if it gets a NACK >>>> regardless it isn't worth the time - thoughts? >>>> >>>> case IORING_OP_OPENAT: >>>> audit_uring_entry(req->opcode); >>>> ret = io_openat(req, issue_flags); >>>> audit_uring_exit(!ret, ret); >>>> break; >>> >>> I wanted to pose this question again in case it was lost in the >>> thread, I suspect this may be the last option before we have to "fix" >>> things at the Kconfig level. I definitely don't want to have to go >>> that route, and I suspect most everyone on this thread feels the same, >>> so I'm hopeful we can find a solution that is begrudgingly acceptable >>> to both groups. >> >> May work for me, but have to ask how many, and what is the >> criteria? I'd think anything opening a file or manipulating fs: >> >> IORING_OP_ACCEPT, IORING_OP_CONNECT, IORING_OP_OPENAT[2], >> IORING_OP_RENAMEAT, IORING_OP_UNLINKAT, IORING_OP_SHUTDOWN, >> IORING_OP_FILES_UPDATE >> + coming mkdirat and others. >> >> IORING_OP_CLOSE? IORING_OP_SEND IORING_OP_RECV? >> >> What about? >> IORING_OP_FSYNC, IORING_OP_SYNC_FILE_RANGE, >> IORING_OP_FALLOCATE, IORING_OP_STATX, >> IORING_OP_FADVISE, IORING_OP_MADVISE, >> IORING_OP_EPOLL_CTL >> >> >> Another question, io_uring may exercise asynchronous paths, >> i.e. io_issue_sqe() returns before requests completes. >> Shouldn't be the case for open/etc at the moment, but was that >> considered? > > This would be why audit needs to monitor a thread until it wraps up, to > wait for the result code. My understanding is that both sync and async > parts of an op would be monitored. There may be a misunderstanding audit_start(req) ret = io_issue_sqe(req); audit_end(ret); io_issue_sqe() may return 0 but leave the request inflight, which will be completed asynchronously e.g. by IRQ, not going through io_issue_sqe() or any io_read()/etc helpers again, and after last audit_end() had already happened. That's the case with read/write/timeout, but is not true for open/etc. >> I don't see it happening, but would prefer to keep it open >> async reimplementation in a distant future. Does audit sleep? > > Some parts do, some parts don't depending on what they are interacting > with in the kernel. It can be made to not sleep if needed. Ok, good
On 6/2/21 8:46 PM, Paul Moore wrote: > On Wed, Jun 2, 2021 at 4:27 AM Pavel Begunkov <asml.silence@gmail.com> wrote: >> On 5/28/21 5:02 PM, Paul Moore wrote: >>> On Wed, May 26, 2021 at 4:19 PM Paul Moore <paul@paul-moore.com> wrote: >>>> ... If we moved the _entry >>>> and _exit calls into the individual operation case blocks (quick >>>> openat example below) so that only certain operations were able to be >>>> audited would that be acceptable assuming the high frequency ops were >>>> untouched? My initial gut feeling was that this would involve >50% of >>>> the ops, but Steve Grubb seems to think it would be less; it may be >>>> time to look at that a bit more seriously, but if it gets a NACK >>>> regardless it isn't worth the time - thoughts? >>>> >>>> case IORING_OP_OPENAT: >>>> audit_uring_entry(req->opcode); >>>> ret = io_openat(req, issue_flags); >>>> audit_uring_exit(!ret, ret); >>>> break; >>> >>> I wanted to pose this question again in case it was lost in the >>> thread, I suspect this may be the last option before we have to "fix" >>> things at the Kconfig level. I definitely don't want to have to go >>> that route, and I suspect most everyone on this thread feels the same, >>> so I'm hopeful we can find a solution that is begrudgingly acceptable >>> to both groups. >> >> May work for me, but have to ask how many, and what is the >> criteria? I'd think anything opening a file or manipulating fs: >> >> IORING_OP_ACCEPT, IORING_OP_CONNECT, IORING_OP_OPENAT[2], >> IORING_OP_RENAMEAT, IORING_OP_UNLINKAT, IORING_OP_SHUTDOWN, >> IORING_OP_FILES_UPDATE >> + coming mkdirat and others. >> >> IORING_OP_CLOSE? IORING_OP_SEND IORING_OP_RECV? >> >> What about? >> IORING_OP_FSYNC, IORING_OP_SYNC_FILE_RANGE, >> IORING_OP_FALLOCATE, IORING_OP_STATX, >> IORING_OP_FADVISE, IORING_OP_MADVISE, >> IORING_OP_EPOLL_CTL > > Looking quickly at v5.13-rc4 the following seems like candidates for > auditing, there may be a small number of subtractions/additions to > this list as people take a closer look, but it should serve as a > starting point: > > IORING_OP_SENDMSG > IORING_OP_RECVMSG > IORING_OP_ACCEPT > IORING_OP_CONNECT > IORING_OP_FALLOCATE > IORING_OP_OPENAT > IORING_OP_CLOSE > IORING_OP_MADVISE > IORING_OP_OPENAT2 > IORING_OP_SHUTDOWN > IORING_OP_RENAMEAT > IORING_OP_UNLINKAT > > ... can you live with that list? it will bloat binary somewhat, but considering it's all in one place -- io_issue_sqe(), it's workable. Not nice to have send/recv msg in the list, but I admit they may do some crazy things. What can be traced for them? Because at the moment of issue_sqe() not everything is read from the userspace. see: io_sendmsg() { ...; io_sendmsg_copy_hdr(); }, will copy header only in io_sendmsg() in most cases, and then stash it for re-issuing if needed. >> Another question, io_uring may exercise asynchronous paths, >> i.e. io_issue_sqe() returns before requests completes. >> Shouldn't be the case for open/etc at the moment, but was that >> considered? > > Yes, I noticed that when testing the code (and it makes sense when you > look at how io_uring handles things). Depending on the state of the > system when the io_uring request is submitted I've seen both sync and > async io_uring operations with the associated different calling > contexts. In the case where io_issue_sqe() needs to defer the > operation to a different context you will see an audit record > indicating that the operation failed and then another audit record > when it completes; it's actually pretty interesting to be able to see > how the system and io_uring are working. Copying a reply to another message to keep clear out of misunderstanding. "io_issue_sqe() may return 0 but leave the request inflight, which will be completed asynchronously e.g. by IRQ, not going through io_issue_sqe() or any io_read()/etc helpers again, and after last audit_end() had already happened. That's the case with read/write/timeout, but is not true for open/etc." And there is interest in async send/recv[msg] as well (via IRQ as described, callbacks, etc.). > We could always mask out these delayed attempts, but at this early > stage they are helpful, and they may be useful for admins. > >> I don't see it happening, but would prefer to keep it open >> async reimplementation in a distant future. Does audit sleep? > > The only place in the audit_uring_entry()/audit_uring_exit() code path > that could sleep at present is the call to audit_log_uring() which is > made when the rules dictate that an audit record be generated. The > offending code is an allocation in audit_log_uring() which is > currently GFP_KERNEL but really should be GFP_ATOMIC, or similar. It > was a copy-n-paste from the similar syscall function where GFP_KERNEL > is appropriate due to the calling context at the end of the syscall. > I'll change that as soon as I'm done with this email. Ok, depends where it steers, but there may be a requirement to not sleep for some hooks because of not having a sleepable context. > > Of course if you are calling io_uring_enter(2), or something similar, > then audit may sleep as part of the normal syscall processing (as > mentioned above), but that is due to the fact that io_uring_enter(2) > is a syscall and not because of anything in io_issue_sqe(). >
On 5/28/21 10:02 AM, Paul Moore wrote: > On Wed, May 26, 2021 at 4:19 PM Paul Moore <paul@paul-moore.com> wrote: >> ... If we moved the _entry >> and _exit calls into the individual operation case blocks (quick >> openat example below) so that only certain operations were able to be >> audited would that be acceptable assuming the high frequency ops were >> untouched? My initial gut feeling was that this would involve >50% of >> the ops, but Steve Grubb seems to think it would be less; it may be >> time to look at that a bit more seriously, but if it gets a NACK >> regardless it isn't worth the time - thoughts? >> >> case IORING_OP_OPENAT: >> audit_uring_entry(req->opcode); >> ret = io_openat(req, issue_flags); >> audit_uring_exit(!ret, ret); >> break; > > I wanted to pose this question again in case it was lost in the > thread, I suspect this may be the last option before we have to "fix" > things at the Kconfig level. I definitely don't want to have to go > that route, and I suspect most everyone on this thread feels the same, > so I'm hopeful we can find a solution that is begrudgingly acceptable > to both groups. Sorry for the lack of response here, but to sum up my order of preference: 1) It's probably better to just make the audit an opt-out in io_op_defs for each opcode, and avoid needing boiler plate code for each op handler. The opt-out would ensure that new opcodes get it by default it someone doesn't know what it is, and the io_op_defs addition would mean that it's in generic code rather then in the handlers. Yes it's a bit slower, but it's saner imho. 2) With the above, I'm fine with adding this to io_uring. I don't think going the route of mutual exclusion in kconfig helps anyone, it'd be counter productive to both sides. Hope that works and helps move this forward. I'll be mostly out of touch the next week and a half, but wanted to ensure that I sent out my (brief) thoughts before going away.
On 6/3/2021 3:51 AM, Pavel Begunkov wrote: > On 6/2/21 8:46 PM, Paul Moore wrote: >> On Wed, Jun 2, 2021 at 4:27 AM Pavel Begunkov <asml.silence@gmail.com> wrote: >>> On 5/28/21 5:02 PM, Paul Moore wrote: >>>> On Wed, May 26, 2021 at 4:19 PM Paul Moore <paul@paul-moore.com> wrote: >>>>> ... If we moved the _entry >>>>> and _exit calls into the individual operation case blocks (quick >>>>> openat example below) so that only certain operations were able to be >>>>> audited would that be acceptable assuming the high frequency ops were >>>>> untouched? My initial gut feeling was that this would involve >50% of >>>>> the ops, but Steve Grubb seems to think it would be less; it may be >>>>> time to look at that a bit more seriously, but if it gets a NACK >>>>> regardless it isn't worth the time - thoughts? >>>>> >>>>> case IORING_OP_OPENAT: >>>>> audit_uring_entry(req->opcode); >>>>> ret = io_openat(req, issue_flags); >>>>> audit_uring_exit(!ret, ret); >>>>> break; >>>> I wanted to pose this question again in case it was lost in the >>>> thread, I suspect this may be the last option before we have to "fix" >>>> things at the Kconfig level. I definitely don't want to have to go >>>> that route, and I suspect most everyone on this thread feels the same, >>>> so I'm hopeful we can find a solution that is begrudgingly acceptable >>>> to both groups. >>> May work for me, but have to ask how many, and what is the >>> criteria? I'd think anything opening a file or manipulating fs: >>> >>> IORING_OP_ACCEPT, IORING_OP_CONNECT, IORING_OP_OPENAT[2], >>> IORING_OP_RENAMEAT, IORING_OP_UNLINKAT, IORING_OP_SHUTDOWN, >>> IORING_OP_FILES_UPDATE >>> + coming mkdirat and others. >>> >>> IORING_OP_CLOSE? IORING_OP_SEND IORING_OP_RECV? >>> >>> What about? >>> IORING_OP_FSYNC, IORING_OP_SYNC_FILE_RANGE, >>> IORING_OP_FALLOCATE, IORING_OP_STATX, >>> IORING_OP_FADVISE, IORING_OP_MADVISE, >>> IORING_OP_EPOLL_CTL >> Looking quickly at v5.13-rc4 the following seems like candidates for >> auditing, there may be a small number of subtractions/additions to >> this list as people take a closer look, but it should serve as a >> starting point: >> >> IORING_OP_SENDMSG >> IORING_OP_RECVMSG >> IORING_OP_ACCEPT >> IORING_OP_CONNECT >> IORING_OP_FALLOCATE >> IORING_OP_OPENAT >> IORING_OP_CLOSE >> IORING_OP_MADVISE >> IORING_OP_OPENAT2 >> IORING_OP_SHUTDOWN >> IORING_OP_RENAMEAT >> IORING_OP_UNLINKAT >> >> ... can you live with that list? > it will bloat binary somewhat, but considering it's all in one > place -- io_issue_sqe(), it's workable. > > Not nice to have send/recv msg in the list, but I admit they > may do some crazy things. What can be traced for them? Both SELinux and Smack do access checks on packet operations. As access may be denied by these checks, audit needs to be available. This is true for UDS, IP and at least one other protocol family. > Because > at the moment of issue_sqe() not everything is read from the > userspace. > > see: io_sendmsg() { ...; io_sendmsg_copy_hdr(); }, > > will copy header only in io_sendmsg() in most cases, and > then stash it for re-issuing if needed. > > >>> Another question, io_uring may exercise asynchronous paths, >>> i.e. io_issue_sqe() returns before requests completes. >>> Shouldn't be the case for open/etc at the moment, but was that >>> considered? >> Yes, I noticed that when testing the code (and it makes sense when you >> look at how io_uring handles things). Depending on the state of the >> system when the io_uring request is submitted I've seen both sync and >> async io_uring operations with the associated different calling >> contexts. In the case where io_issue_sqe() needs to defer the >> operation to a different context you will see an audit record >> indicating that the operation failed and then another audit record >> when it completes; it's actually pretty interesting to be able to see >> how the system and io_uring are working. > Copying a reply to another message to keep clear out > of misunderstanding. > > "io_issue_sqe() may return 0 but leave the request inflight, > which will be completed asynchronously e.g. by IRQ, not going > through io_issue_sqe() or any io_read()/etc helpers again, and > after last audit_end() had already happened. > That's the case with read/write/timeout, but is not true for > open/etc." > > And there is interest in async send/recv[msg] as well (via > IRQ as described, callbacks, etc.). > >> We could always mask out these delayed attempts, but at this early >> stage they are helpful, and they may be useful for admins. >> >>> I don't see it happening, but would prefer to keep it open >>> async reimplementation in a distant future. Does audit sleep? >> The only place in the audit_uring_entry()/audit_uring_exit() code path >> that could sleep at present is the call to audit_log_uring() which is >> made when the rules dictate that an audit record be generated. The >> offending code is an allocation in audit_log_uring() which is >> currently GFP_KERNEL but really should be GFP_ATOMIC, or similar. It >> was a copy-n-paste from the similar syscall function where GFP_KERNEL >> is appropriate due to the calling context at the end of the syscall. >> I'll change that as soon as I'm done with this email. > Ok, depends where it steers, but there may be a requirement to > not sleep for some hooks because of not having a sleepable context. > >> Of course if you are calling io_uring_enter(2), or something similar, >> then audit may sleep as part of the normal syscall processing (as >> mentioned above), but that is due to the fact that io_uring_enter(2) >> is a syscall and not because of anything in io_issue_sqe(). >>
On Thu, Jun 3, 2021 at 11:54 AM Jens Axboe <axboe@kernel.dk> wrote: > On 5/28/21 10:02 AM, Paul Moore wrote: > > On Wed, May 26, 2021 at 4:19 PM Paul Moore <paul@paul-moore.com> wrote: > >> ... If we moved the _entry > >> and _exit calls into the individual operation case blocks (quick > >> openat example below) so that only certain operations were able to be > >> audited would that be acceptable assuming the high frequency ops were > >> untouched? My initial gut feeling was that this would involve >50% of > >> the ops, but Steve Grubb seems to think it would be less; it may be > >> time to look at that a bit more seriously, but if it gets a NACK > >> regardless it isn't worth the time - thoughts? > >> > >> case IORING_OP_OPENAT: > >> audit_uring_entry(req->opcode); > >> ret = io_openat(req, issue_flags); > >> audit_uring_exit(!ret, ret); > >> break; > > > > I wanted to pose this question again in case it was lost in the > > thread, I suspect this may be the last option before we have to "fix" > > things at the Kconfig level. I definitely don't want to have to go > > that route, and I suspect most everyone on this thread feels the same, > > so I'm hopeful we can find a solution that is begrudgingly acceptable > > to both groups. > > Sorry for the lack of response here, but to sum up my order of > preference: > > 1) It's probably better to just make the audit an opt-out in io_op_defs > for each opcode, and avoid needing boiler plate code for each op > handler. The opt-out would ensure that new opcodes get it by default > it someone doesn't know what it is, and the io_op_defs addition would > mean that it's in generic code rather then in the handlers. Yes it's > a bit slower, but it's saner imho. > > 2) With the above, I'm fine with adding this to io_uring. I don't think > going the route of mutual exclusion in kconfig helps anyone, it'd > be counter productive to both sides. > > Hope that works and helps move this forward. I'll be mostly out of touch > the next week and a half, but wanted to ensure that I sent out my > (brief) thoughts before going away. Thanks Jens. I'll revise the patchset based on this (basically doing an opt-out version of what you did on May 26th) and do a v2 post with the other accumulated fixes/changes. If there is anything else that needs discussion/review I'm sure Pavel can help us out, he's been helpful thus far. -- paul moore www.paul-moore.com
On 2021-06-02 13:46, Paul Moore wrote: > On Wed, Jun 2, 2021 at 1:29 PM Richard Guy Briggs <rgb@redhat.com> wrote: > > On 2021-05-21 17:49, Paul Moore wrote: > > > WARNING - This is a work in progress and should not be merged > > > anywhere important. It is almost surely not complete, and while it > > > probably compiles it likely hasn't been booted and will do terrible > > > things. You have been warned. > > > > > > This patch adds basic auditing to io_uring operations, regardless of > > > their context. This is accomplished by allocating audit_context > > > structures for the io-wq worker and io_uring SQPOLL kernel threads > > > as well as explicitly auditing the io_uring operations in > > > io_issue_sqe(). The io_uring operations are audited using a new > > > AUDIT_URINGOP record, an example is shown below: > > > > > > % <TODO - insert AUDIT_URINGOP record example> > > > > > > Thanks to Richard Guy Briggs for review and feedback. > > > > > > Signed-off-by: Paul Moore <paul@paul-moore.com> > > > --- > > > fs/io-wq.c | 4 + > > > fs/io_uring.c | 11 +++ > > > include/linux/audit.h | 17 ++++ > > > include/uapi/linux/audit.h | 1 > > > kernel/audit.h | 2 + > > > kernel/auditsc.c | 173 ++++++++++++++++++++++++++++++++++++++++++++ > > > 6 files changed, 208 insertions(+) > > ... > > > > diff --git a/fs/io_uring.c b/fs/io_uring.c > > > index e481ac8a757a..e9941d1ad8fd 100644 > > > --- a/fs/io_uring.c > > > +++ b/fs/io_uring.c > > > @@ -78,6 +78,7 @@ > > > #include <linux/task_work.h> > > > #include <linux/pagemap.h> > > > #include <linux/io_uring.h> > > > +#include <linux/audit.h> > > > > > > #define CREATE_TRACE_POINTS > > > #include <trace/events/io_uring.h> > > > @@ -6105,6 +6106,9 @@ static int io_issue_sqe(struct io_kiocb *req, unsigned int issue_flags) > > > if (req->work.creds && req->work.creds != current_cred()) > > > creds = override_creds(req->work.creds); > > > > > > + if (req->opcode < IORING_OP_LAST) > > > + audit_uring_entry(req->opcode); > > Note well the override_creds() call right above the audit code that is > being added, it will be important later in this email. > > > > diff --git a/kernel/auditsc.c b/kernel/auditsc.c > > > index cc89e9f9a753..729849d41631 100644 > > > --- a/kernel/auditsc.c > > > +++ b/kernel/auditsc.c > > > @@ -1536,6 +1562,52 @@ static void audit_log_proctitle(void) > > > audit_log_end(ab); > > > } > > > > > > +/** > > > + * audit_log_uring - generate a AUDIT_URINGOP record > > > + * @ctx: the audit context > > > + */ > > > +static void audit_log_uring(struct audit_context *ctx) > > > +{ > > > + struct audit_buffer *ab; > > > + const struct cred *cred; > > > + > > > + /* > > > + * TODO: What do we log here? I'm tossing in a few things to start the > > > + * conversation, but additional thought needs to go into this. > > > + */ > > > + > > > + ab = audit_log_start(ctx, GFP_KERNEL, AUDIT_URINGOP); > > > + if (!ab) > > > + return; > > > + cred = current_cred(); > > > > This may need to be req->work.creds. I haven't been following if the > > io_uring thread inherited the user task's creds (and below, comm and > > exe). > > Nope, we're good. See the existing code in io_issue_sqe() :) > > > > + audit_log_format(ab, "uring_op=%d", ctx->uring_op); > > > > arch is stored below in __audit_uring_entry() and never used in the > > AUDIT_CTX_URING case. That assignment can either be dropped or printed > > before uring_op similar to the SYSCALL record. > > Good point, I'll drop the code that records the arch from _entry(); it > is really only useful to give the appropriate context if needed for > other things in the audit stream, and that isn't the case like it is > with syscalls. > > > There aren't really any arg[0-3] to print. > > Which is why I didn't print them. > > > io_uring_register and io_uring_setup() args are better covered by other > > records. io_uring_enter() has 6 args and the last two aren't covered by > > SYSCALL anyways. > > ??? > > I think you are confusing the io_uring ops with syscalls; they are > very different things from an audit perspective and the io_uring > auditing is not intended to replace syscall records. The > io_uring_setup() and io_uring_enter() syscalls will be auditing just > as any other syscalls would be using the existing syscall audit code. > > > > + if (ctx->return_valid != AUDITSC_INVALID) > > > + audit_log_format(ab, " success=%s exit=%ld", > > > + (ctx->return_valid == AUDITSC_SUCCESS ? > > > + "yes" : "no"), > > > + ctx->return_code); > > > + audit_log_format(ab, > > > + " items=%d" > > > + " ppid=%d pid=%d auid=%u uid=%u gid=%u" > > > + " euid=%u suid=%u fsuid=%u" > > > + " egid=%u sgid=%u fsgid=%u", > > > + ctx->name_count, > > > + task_ppid_nr(current), > > > + task_tgid_nr(current), > > > + from_kuid(&init_user_ns, audit_get_loginuid(current)), > > > + from_kuid(&init_user_ns, cred->uid), > > > + from_kgid(&init_user_ns, cred->gid), > > > + from_kuid(&init_user_ns, cred->euid), > > > + from_kuid(&init_user_ns, cred->suid), > > > + from_kuid(&init_user_ns, cred->fsuid), > > > + from_kgid(&init_user_ns, cred->egid), > > > + from_kgid(&init_user_ns, cred->sgid), > > > + from_kgid(&init_user_ns, cred->fsgid)); > > > > The audit session ID is still important, relevant and qualifies auid. > > In keeping with the SYSCALL record format, I think we want to keep > > ses=audit_get_sessionid(current) in here. > > This might be another case of syscall/io_uring confusion. An io_uring > op doesn't necessarily have an audit session ID or an audit UID in the > conventional sense; for example think about SQPOLL works, shared > rings, etc. Right, but those syscalls are what instigate io_uring operations, so whatever process starts that operation, or gets handed that handle should be tracked with auid and sessionid (the two work together to track) unless we can easily track io_uring ops to connect them to a previous setup syscall. If we see a need to keep the auid, then the sessionid goes with it. > > I'm pretty sure we also want to keep comm= and exe= too, but may have to > > reach into req->task to get it. There are two values for comm possible, > > one from the original task and second "iou-sqp-<pid>" set at the top of > > io_sq_thread(). > > I think this is more syscall/io_uring confusion. I wouldn't call them confusion but rather parallels, attributing a particular subject to an action. > paul moore - RGB -- Richard Guy Briggs <rgb@redhat.com> Sr. S/W Engineer, Kernel Security, Base Operating Systems Remote, Ottawa, Red Hat Canada IRC: rgb, SunRaycer Voice: +1.647.777.2635, Internal: (81) 32635
On Tue, Aug 24, 2021 at 9:21 PM Richard Guy Briggs <rgb@redhat.com> wrote: > > On 2021-06-02 13:46, Paul Moore wrote: > > On Wed, Jun 2, 2021 at 1:29 PM Richard Guy Briggs <rgb@redhat.com> wrote: > > > On 2021-05-21 17:49, Paul Moore wrote: > > > > WARNING - This is a work in progress and should not be merged > > > > anywhere important. It is almost surely not complete, and while it > > > > probably compiles it likely hasn't been booted and will do terrible > > > > things. You have been warned. > > > > > > > > This patch adds basic auditing to io_uring operations, regardless of > > > > their context. This is accomplished by allocating audit_context > > > > structures for the io-wq worker and io_uring SQPOLL kernel threads > > > > as well as explicitly auditing the io_uring operations in > > > > io_issue_sqe(). The io_uring operations are audited using a new > > > > AUDIT_URINGOP record, an example is shown below: > > > > > > > > % <TODO - insert AUDIT_URINGOP record example> > > > > > > > > Thanks to Richard Guy Briggs for review and feedback. > > > > > > > > Signed-off-by: Paul Moore <paul@paul-moore.com> > > > > --- > > > > fs/io-wq.c | 4 + > > > > fs/io_uring.c | 11 +++ > > > > include/linux/audit.h | 17 ++++ > > > > include/uapi/linux/audit.h | 1 > > > > kernel/audit.h | 2 + > > > > kernel/auditsc.c | 173 ++++++++++++++++++++++++++++++++++++++++++++ > > > > 6 files changed, 208 insertions(+) ... > > > > + if (ctx->return_valid != AUDITSC_INVALID) > > > > + audit_log_format(ab, " success=%s exit=%ld", > > > > + (ctx->return_valid == AUDITSC_SUCCESS ? > > > > + "yes" : "no"), > > > > + ctx->return_code); > > > > + audit_log_format(ab, > > > > + " items=%d" > > > > + " ppid=%d pid=%d auid=%u uid=%u gid=%u" > > > > + " euid=%u suid=%u fsuid=%u" > > > > + " egid=%u sgid=%u fsgid=%u", > > > > + ctx->name_count, > > > > + task_ppid_nr(current), > > > > + task_tgid_nr(current), > > > > + from_kuid(&init_user_ns, audit_get_loginuid(current)), > > > > + from_kuid(&init_user_ns, cred->uid), > > > > + from_kgid(&init_user_ns, cred->gid), > > > > + from_kuid(&init_user_ns, cred->euid), > > > > + from_kuid(&init_user_ns, cred->suid), > > > > + from_kuid(&init_user_ns, cred->fsuid), > > > > + from_kgid(&init_user_ns, cred->egid), > > > > + from_kgid(&init_user_ns, cred->sgid), > > > > + from_kgid(&init_user_ns, cred->fsgid)); > > > > > > The audit session ID is still important, relevant and qualifies auid. > > > In keeping with the SYSCALL record format, I think we want to keep > > > ses=audit_get_sessionid(current) in here. > > > > This might be another case of syscall/io_uring confusion. An io_uring > > op doesn't necessarily have an audit session ID or an audit UID in the > > conventional sense; for example think about SQPOLL works, shared > > rings, etc. > > Right, but those syscalls are what instigate io_uring operations, so > whatever process starts that operation, or gets handed that handle > should be tracked with auid and sessionid (the two work together to > track) unless we can easily track io_uring ops to connect them to a > previous setup syscall. If we see a need to keep the auid, then the > sessionid goes with it. As a reminder, once the io_uring is created appropriately one can issue io_uring operations without making a syscall. Further, sharing a io_uring across process boundaries means that both the audit session ID and audit login UID used to create the io_uring might not be the same as the subject which issues operations to the io_uring. Any io_uring operations that happen synchronously as the result of a syscall should be associated with the SYSCALL record so the session ID and login UID will be part of the event. Asynchronous operations will not have that information because we don't have a way to get it.
diff --git a/fs/io-wq.c b/fs/io-wq.c index 5361a9b4b47b..8af09a3336e0 100644 --- a/fs/io-wq.c +++ b/fs/io-wq.c @@ -16,6 +16,7 @@ #include <linux/rculist_nulls.h> #include <linux/cpu.h> #include <linux/tracehook.h> +#include <linux/audit.h> #include "io-wq.h" @@ -535,6 +536,8 @@ static int io_wqe_worker(void *data) snprintf(buf, sizeof(buf), "iou-wrk-%d", wq->task->pid); set_task_comm(current, buf); + audit_alloc_kernel(current); + while (!test_bit(IO_WQ_BIT_EXIT, &wq->state)) { long ret; @@ -573,6 +576,7 @@ static int io_wqe_worker(void *data) raw_spin_unlock_irq(&wqe->lock); } + audit_free(current); io_worker_exit(worker); return 0; } diff --git a/fs/io_uring.c b/fs/io_uring.c index e481ac8a757a..e9941d1ad8fd 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -78,6 +78,7 @@ #include <linux/task_work.h> #include <linux/pagemap.h> #include <linux/io_uring.h> +#include <linux/audit.h> #define CREATE_TRACE_POINTS #include <trace/events/io_uring.h> @@ -6105,6 +6106,9 @@ static int io_issue_sqe(struct io_kiocb *req, unsigned int issue_flags) if (req->work.creds && req->work.creds != current_cred()) creds = override_creds(req->work.creds); + if (req->opcode < IORING_OP_LAST) + audit_uring_entry(req->opcode); + switch (req->opcode) { case IORING_OP_NOP: ret = io_nop(req, issue_flags); @@ -6211,6 +6215,9 @@ static int io_issue_sqe(struct io_kiocb *req, unsigned int issue_flags) break; } + if (req->opcode < IORING_OP_LAST) + audit_uring_exit(!ret, ret); + if (creds) revert_creds(creds); @@ -6827,6 +6834,8 @@ static int io_sq_thread(void *data) set_cpus_allowed_ptr(current, cpu_online_mask); current->flags |= PF_NO_SETAFFINITY; + audit_alloc_kernel(current); + mutex_lock(&sqd->lock); /* a user may had exited before the thread started */ io_run_task_work_head(&sqd->park_task_work); @@ -6916,6 +6925,8 @@ static int io_sq_thread(void *data) io_run_task_work_head(&sqd->park_task_work); mutex_unlock(&sqd->lock); + audit_free(current); + complete(&sqd->exited); do_exit(0); } diff --git a/include/linux/audit.h b/include/linux/audit.h index 82b7c1116a85..6a0c013bc7de 100644 --- a/include/linux/audit.h +++ b/include/linux/audit.h @@ -286,7 +286,10 @@ static inline int audit_signal_info(int sig, struct task_struct *t) /* These are defined in auditsc.c */ /* Public API */ extern int audit_alloc(struct task_struct *task); +extern int audit_alloc_kernel(struct task_struct *task); extern void __audit_free(struct task_struct *task); +extern void __audit_uring_entry(u8 op); +extern void __audit_uring_exit(int success, long code); extern void __audit_syscall_entry(int major, unsigned long a0, unsigned long a1, unsigned long a2, unsigned long a3); extern void __audit_syscall_exit(int ret_success, long ret_value); @@ -323,6 +326,16 @@ static inline void audit_free(struct task_struct *task) if (unlikely(task->audit_context)) __audit_free(task); } +static inline void audit_uring_entry(u8 op) +{ + if (unlikely(audit_context())) + __audit_uring_entry(op); +} +static inline void audit_uring_exit(int success, long code) +{ + if (unlikely(audit_context())) + __audit_uring_exit(success, code); +} static inline void audit_syscall_entry(int major, unsigned long a0, unsigned long a1, unsigned long a2, unsigned long a3) @@ -554,6 +567,10 @@ static inline int audit_alloc(struct task_struct *task) { return 0; } +static inline int audit_alloc_kernel(struct task_struct *task) +{ + return 0; +} static inline void audit_free(struct task_struct *task) { } static inline void audit_syscall_entry(int major, unsigned long a0, diff --git a/include/uapi/linux/audit.h b/include/uapi/linux/audit.h index cd2d8279a5e4..b26e0c435e8b 100644 --- a/include/uapi/linux/audit.h +++ b/include/uapi/linux/audit.h @@ -118,6 +118,7 @@ #define AUDIT_TIME_ADJNTPVAL 1333 /* NTP value adjustment */ #define AUDIT_BPF 1334 /* BPF subsystem */ #define AUDIT_EVENT_LISTENER 1335 /* Task joined multicast read socket */ +#define AUDIT_URINGOP 1336 /* io_uring operation */ #define AUDIT_AVC 1400 /* SE Linux avc denial or grant */ #define AUDIT_SELINUX_ERR 1401 /* Internal SE Linux Errors */ diff --git a/kernel/audit.h b/kernel/audit.h index fba180de5912..50de827497ca 100644 --- a/kernel/audit.h +++ b/kernel/audit.h @@ -100,10 +100,12 @@ struct audit_context { enum { AUDIT_CTX_UNUSED, /* audit_context is currently unused */ AUDIT_CTX_SYSCALL, /* in use by syscall */ + AUDIT_CTX_URING, /* in use by io_uring */ } context; enum audit_state state, current_state; unsigned int serial; /* serial number for record */ int major; /* syscall number */ + int uring_op; /* uring operation */ struct timespec64 ctime; /* time of syscall entry */ unsigned long argv[4]; /* syscall arguments */ long return_code;/* syscall return code */ diff --git a/kernel/auditsc.c b/kernel/auditsc.c index cc89e9f9a753..729849d41631 100644 --- a/kernel/auditsc.c +++ b/kernel/auditsc.c @@ -953,6 +953,7 @@ static void audit_reset_context(struct audit_context *ctx) ctx->current_state = ctx->state; ctx->serial = 0; ctx->major = 0; + ctx->uring_op = 0; ctx->ctime = (struct timespec64){ .tv_sec = 0, .tv_nsec = 0 }; memset(ctx->argv, 0, sizeof(ctx->argv)); ctx->return_code = 0; @@ -1038,6 +1039,31 @@ int audit_alloc(struct task_struct *tsk) return 0; } +/** + * audit_alloc_kernel - allocate an audit_context for a kernel task + * @tsk: the kernel task + * + * Similar to the audit_alloc() function, but intended for kernel private + * threads. Returns zero on success, negative values on failure. + */ +int audit_alloc_kernel(struct task_struct *tsk) +{ + /* + * At the moment we are just going to call into audit_alloc() to + * simplify the code, but there two things to keep in mind with this + * approach: + * + * 1. Filtering internal kernel tasks is a bit laughable in almost all + * cases, but there is at least one case where there is a benefit: + * the '-a task,never' case allows the admin to effectively disable + * task auditing at runtime. + * + * 2. The {set,clear}_task_syscall_work() ops likely have zero effect + * on these internal kernel tasks, but they probably don't hurt either. + */ + return audit_alloc(tsk); +} + static inline void audit_free_context(struct audit_context *context) { /* resetting is extra work, but it is likely just noise */ @@ -1536,6 +1562,52 @@ static void audit_log_proctitle(void) audit_log_end(ab); } +/** + * audit_log_uring - generate a AUDIT_URINGOP record + * @ctx: the audit context + */ +static void audit_log_uring(struct audit_context *ctx) +{ + struct audit_buffer *ab; + const struct cred *cred; + + /* + * TODO: What do we log here? I'm tossing in a few things to start the + * conversation, but additional thought needs to go into this. + */ + + ab = audit_log_start(ctx, GFP_KERNEL, AUDIT_URINGOP); + if (!ab) + return; + cred = current_cred(); + audit_log_format(ab, "uring_op=%d", ctx->uring_op); + if (ctx->return_valid != AUDITSC_INVALID) + audit_log_format(ab, " success=%s exit=%ld", + (ctx->return_valid == AUDITSC_SUCCESS ? + "yes" : "no"), + ctx->return_code); + audit_log_format(ab, + " items=%d" + " ppid=%d pid=%d auid=%u uid=%u gid=%u" + " euid=%u suid=%u fsuid=%u" + " egid=%u sgid=%u fsgid=%u", + ctx->name_count, + task_ppid_nr(current), + task_tgid_nr(current), + from_kuid(&init_user_ns, audit_get_loginuid(current)), + from_kuid(&init_user_ns, cred->uid), + from_kgid(&init_user_ns, cred->gid), + from_kuid(&init_user_ns, cred->euid), + from_kuid(&init_user_ns, cred->suid), + from_kuid(&init_user_ns, cred->fsuid), + from_kgid(&init_user_ns, cred->egid), + from_kgid(&init_user_ns, cred->sgid), + from_kgid(&init_user_ns, cred->fsgid)); + audit_log_task_context(ab); + audit_log_key(ab, ctx->filterkey); + audit_log_end(ab); +} + static void audit_log_exit(void) { int i, call_panic = 0; @@ -1571,6 +1643,9 @@ static void audit_log_exit(void) audit_log_key(ab, context->filterkey); audit_log_end(ab); break; + case AUDIT_CTX_URING: + audit_log_uring(context); + break; default: BUG(); break; @@ -1740,6 +1815,104 @@ static void audit_return_fixup(struct audit_context *ctx, ctx->return_valid = (success ? AUDITSC_SUCCESS : AUDITSC_FAILURE); } +/** + * __audit_uring_entry - prepare the kernel task's audit context for io_uring + * @op: the io_uring opcode + * + * This is similar to audit_syscall_entry() but is intended for use by io_uring + * operations. + */ +void __audit_uring_entry(u8 op) +{ + struct audit_context *ctx = audit_context(); + + if (!audit_enabled || !ctx || ctx->state == AUDIT_DISABLED) + return; + + /* + * NOTE: It's possible that we can be called from the process' context + * before it returns to userspace, and before audit_syscall_exit() + * is called. In this case there is not much to do, just record + * the io_uring details and return. + */ + ctx->uring_op = op; + if (ctx->context == AUDIT_CTX_SYSCALL) + return; + + ctx->dummy = !audit_n_rules; + if (!ctx->dummy && ctx->state == AUDIT_BUILD_CONTEXT) + ctx->prio = 0; + + ctx->arch = syscall_get_arch(current); + ctx->context = AUDIT_CTX_URING; + ctx->current_state = ctx->state; + ktime_get_coarse_real_ts64(&ctx->ctime); +} + +/** + * __audit_uring_exit - wrap up the kernel task's audit context after io_uring + * @success: true/false value to indicate if the operation succeeded or not + * @code: operation return code + * + * This is similar to audit_syscall_exit() but is intended for use by io_uring + * operations. + */ +void __audit_uring_exit(int success, long code) +{ + struct audit_context *ctx = audit_context(); + + /* + * TODO: At some point we will likely want to filter on io_uring ops + * and other things similar to what we do for syscalls, but that + * is something for another day; just record what we can here. + */ + + if (!ctx || ctx->dummy) + goto out; + if (ctx->context == AUDIT_CTX_SYSCALL) { + /* + * NOTE: See the note in __audit_uring_entry() about the case + * where we may be called from process context before we + * return to userspace via audit_syscall_exit(). In this + * case we simply emit a URINGOP record and bail, the + * normal syscall exit handling will take care of + * everything else. + * It is also worth mentioning that when we are called, + * the current process creds may differ from the creds + * used during the normal syscall processing; keep that + * in mind if/when we move the record generation code. + */ + + /* + * We need to filter on the syscall info here to decide if we + * should emit a URINGOP record. I know it seems odd but this + * solves the problem where users have a filter to block *all* + * syscall records in the "exit" filter; we want to preserve + * the behavior here. + */ + audit_filter_syscall(current, ctx); + audit_filter_inodes(current, ctx); + if (ctx->current_state != AUDIT_RECORD_CONTEXT) + return; + + audit_log_uring(ctx); + return; + } + + /* this may generate CONFIG_CHANGE records */ + if (!list_empty(&ctx->killed_trees)) + audit_kill_trees(ctx); + + audit_filter_inodes(current, ctx); + if (ctx->current_state != AUDIT_RECORD_CONTEXT) + goto out; + audit_return_fixup(ctx, success, code); + audit_log_exit(); + +out: + audit_reset_context(ctx); +} + /** * __audit_syscall_entry - fill in an audit record at syscall entry * @major: major syscall type (function)
WARNING - This is a work in progress and should not be merged anywhere important. It is almost surely not complete, and while it probably compiles it likely hasn't been booted and will do terrible things. You have been warned. This patch adds basic auditing to io_uring operations, regardless of their context. This is accomplished by allocating audit_context structures for the io-wq worker and io_uring SQPOLL kernel threads as well as explicitly auditing the io_uring operations in io_issue_sqe(). The io_uring operations are audited using a new AUDIT_URINGOP record, an example is shown below: % <TODO - insert AUDIT_URINGOP record example> Thanks to Richard Guy Briggs for review and feedback. Signed-off-by: Paul Moore <paul@paul-moore.com> --- fs/io-wq.c | 4 + fs/io_uring.c | 11 +++ include/linux/audit.h | 17 ++++ include/uapi/linux/audit.h | 1 kernel/audit.h | 2 + kernel/auditsc.c | 173 ++++++++++++++++++++++++++++++++++++++++++++ 6 files changed, 208 insertions(+)