diff mbox series

[ghak90,V8,07/16] audit: add contid support for signalling the audit daemon

Message ID 7d7933d742fdf4a94c84b791906a450b16f2e81f.1577736799.git.rgb@redhat.com (mailing list archive)
State New, archived
Headers show
Series audit: implement container identifier | expand

Commit Message

Richard Guy Briggs Dec. 31, 2019, 7:48 p.m. UTC
Add audit container identifier support to the action of signalling the
audit daemon.

Since this would need to add an element to the audit_sig_info struct,
a new record type AUDIT_SIGNAL_INFO2 was created with a new
audit_sig_info2 struct.  Corresponding support is required in the
userspace code to reflect the new record request and reply type.
An older userspace won't break since it won't know to request this
record type.

Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
---
 include/linux/audit.h       |  7 +++++++
 include/uapi/linux/audit.h  |  1 +
 kernel/audit.c              | 35 +++++++++++++++++++++++++++++++++++
 kernel/audit.h              |  1 +
 security/selinux/nlmsgtab.c |  1 +
 5 files changed, 45 insertions(+)

Comments

Paul Moore Jan. 22, 2020, 9:28 p.m. UTC | #1
On Tue, Dec 31, 2019 at 2:50 PM Richard Guy Briggs <rgb@redhat.com> wrote:
>
> Add audit container identifier support to the action of signalling the
> audit daemon.
>
> Since this would need to add an element to the audit_sig_info struct,
> a new record type AUDIT_SIGNAL_INFO2 was created with a new
> audit_sig_info2 struct.  Corresponding support is required in the
> userspace code to reflect the new record request and reply type.
> An older userspace won't break since it won't know to request this
> record type.
>
> Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
> ---
>  include/linux/audit.h       |  7 +++++++
>  include/uapi/linux/audit.h  |  1 +
>  kernel/audit.c              | 35 +++++++++++++++++++++++++++++++++++
>  kernel/audit.h              |  1 +
>  security/selinux/nlmsgtab.c |  1 +
>  5 files changed, 45 insertions(+)

...

> diff --git a/kernel/audit.c b/kernel/audit.c
> index 0871c3e5d6df..51159c94041c 100644
> --- a/kernel/audit.c
> +++ b/kernel/audit.c
> @@ -126,6 +126,14 @@ struct auditd_connection {
>  kuid_t         audit_sig_uid = INVALID_UID;
>  pid_t          audit_sig_pid = -1;
>  u32            audit_sig_sid = 0;
> +/* Since the signal information is stored in the record buffer at the
> + * time of the signal, but not retrieved until later, there is a chance
> + * that the last process in the container could terminate before the
> + * signal record is delivered.  In this circumstance, there is a chance
> + * the orchestrator could reuse the audit container identifier, causing
> + * an overlap of audit records that refer to the same audit container
> + * identifier, but a different container instance.  */
> +u64            audit_sig_cid = AUDIT_CID_UNSET;

I believe we could prevent the case mentioned above by taking an
additional reference to the audit container ID object when the signal
information is collected, dropping it only after the signal
information is collected by userspace or another process signals the
audit daemon.  Yes, it would block that audit container ID from being
reused immediately, but since we are talking about one number out of
2^64 that seems like a reasonable tradeoff.

--
paul moore
www.paul-moore.com
Richard Guy Briggs Jan. 23, 2020, 4:29 p.m. UTC | #2
On 2020-01-22 16:28, Paul Moore wrote:
> On Tue, Dec 31, 2019 at 2:50 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> >
> > Add audit container identifier support to the action of signalling the
> > audit daemon.
> >
> > Since this would need to add an element to the audit_sig_info struct,
> > a new record type AUDIT_SIGNAL_INFO2 was created with a new
> > audit_sig_info2 struct.  Corresponding support is required in the
> > userspace code to reflect the new record request and reply type.
> > An older userspace won't break since it won't know to request this
> > record type.
> >
> > Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
> > ---
> >  include/linux/audit.h       |  7 +++++++
> >  include/uapi/linux/audit.h  |  1 +
> >  kernel/audit.c              | 35 +++++++++++++++++++++++++++++++++++
> >  kernel/audit.h              |  1 +
> >  security/selinux/nlmsgtab.c |  1 +
> >  5 files changed, 45 insertions(+)
> 
> ...
> 
> > diff --git a/kernel/audit.c b/kernel/audit.c
> > index 0871c3e5d6df..51159c94041c 100644
> > --- a/kernel/audit.c
> > +++ b/kernel/audit.c
> > @@ -126,6 +126,14 @@ struct auditd_connection {
> >  kuid_t         audit_sig_uid = INVALID_UID;
> >  pid_t          audit_sig_pid = -1;
> >  u32            audit_sig_sid = 0;
> > +/* Since the signal information is stored in the record buffer at the
> > + * time of the signal, but not retrieved until later, there is a chance
> > + * that the last process in the container could terminate before the
> > + * signal record is delivered.  In this circumstance, there is a chance
> > + * the orchestrator could reuse the audit container identifier, causing
> > + * an overlap of audit records that refer to the same audit container
> > + * identifier, but a different container instance.  */
> > +u64            audit_sig_cid = AUDIT_CID_UNSET;
> 
> I believe we could prevent the case mentioned above by taking an
> additional reference to the audit container ID object when the signal
> information is collected, dropping it only after the signal
> information is collected by userspace or another process signals the
> audit daemon.  Yes, it would block that audit container ID from being
> reused immediately, but since we are talking about one number out of
> 2^64 that seems like a reasonable tradeoff.

I had thought that through and should have been more explicit about that
situation when I documented it.  We could do that, but then the syscall
records would be connected with the call from auditd on shutdown to
request that signal information, rather than the exit of that last
process that was using that container.  This strikes me as misleading.
Is that really what we want?

> paul moore

- RGB

--
Richard Guy Briggs <rgb@redhat.com>
Sr. S/W Engineer, Kernel Security, Base Operating Systems
Remote, Ottawa, Red Hat Canada
IRC: rgb, SunRaycer
Voice: +1.647.777.2635, Internal: (81) 32635
Paul Moore Jan. 23, 2020, 5:09 p.m. UTC | #3
On Thu, Jan 23, 2020 at 11:29 AM Richard Guy Briggs <rgb@redhat.com> wrote:
> On 2020-01-22 16:28, Paul Moore wrote:
> > On Tue, Dec 31, 2019 at 2:50 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> > >
> > > Add audit container identifier support to the action of signalling the
> > > audit daemon.
> > >
> > > Since this would need to add an element to the audit_sig_info struct,
> > > a new record type AUDIT_SIGNAL_INFO2 was created with a new
> > > audit_sig_info2 struct.  Corresponding support is required in the
> > > userspace code to reflect the new record request and reply type.
> > > An older userspace won't break since it won't know to request this
> > > record type.
> > >
> > > Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
> > > ---
> > >  include/linux/audit.h       |  7 +++++++
> > >  include/uapi/linux/audit.h  |  1 +
> > >  kernel/audit.c              | 35 +++++++++++++++++++++++++++++++++++
> > >  kernel/audit.h              |  1 +
> > >  security/selinux/nlmsgtab.c |  1 +
> > >  5 files changed, 45 insertions(+)
> >
> > ...
> >
> > > diff --git a/kernel/audit.c b/kernel/audit.c
> > > index 0871c3e5d6df..51159c94041c 100644
> > > --- a/kernel/audit.c
> > > +++ b/kernel/audit.c
> > > @@ -126,6 +126,14 @@ struct auditd_connection {
> > >  kuid_t         audit_sig_uid = INVALID_UID;
> > >  pid_t          audit_sig_pid = -1;
> > >  u32            audit_sig_sid = 0;
> > > +/* Since the signal information is stored in the record buffer at the
> > > + * time of the signal, but not retrieved until later, there is a chance
> > > + * that the last process in the container could terminate before the
> > > + * signal record is delivered.  In this circumstance, there is a chance
> > > + * the orchestrator could reuse the audit container identifier, causing
> > > + * an overlap of audit records that refer to the same audit container
> > > + * identifier, but a different container instance.  */
> > > +u64            audit_sig_cid = AUDIT_CID_UNSET;
> >
> > I believe we could prevent the case mentioned above by taking an
> > additional reference to the audit container ID object when the signal
> > information is collected, dropping it only after the signal
> > information is collected by userspace or another process signals the
> > audit daemon.  Yes, it would block that audit container ID from being
> > reused immediately, but since we are talking about one number out of
> > 2^64 that seems like a reasonable tradeoff.
>
> I had thought that through and should have been more explicit about that
> situation when I documented it.  We could do that, but then the syscall
> records would be connected with the call from auditd on shutdown to
> request that signal information, rather than the exit of that last
> process that was using that container.  This strikes me as misleading.
> Is that really what we want?

 ???

I think one of us is not understanding the other; maybe it's me, maybe
it's you, maybe it's both of us.

Anyway, here is what I was trying to convey with my original comment
... When we record the audit container ID in audit_signal_info() we
take an extra reference to the audit container ID object so that it
will not disappear (and get reused) until after we respond with an
AUDIT_SIGNAL_INFO2.  In audit_receive_msg() when we do the
AUDIT_SIGNAL_INFO2 processing we drop the extra reference we took in
audit_signal_info().  Unless I'm missing some other change you made,
this *shouldn't* affect the syscall records, all it does is preserve
the audit container ID object in the kernel's ACID store so it doesn't
get reused.

(We do need to do some extra housekeeping in audit_signal_info() to
deal with the case where nobody asks for AUDIT_SIGNAL_INFO2 -
basically if audit_sig_cid is not NULL we should drop a reference
before assigning it a new object pointer, and of course we would need
to set audit_sig_cid to NULL in audit_receive_msg() after sending it
up to userspace and dropping the extra ref.)
Richard Guy Briggs Jan. 23, 2020, 8:04 p.m. UTC | #4
On 2020-01-23 12:09, Paul Moore wrote:
> On Thu, Jan 23, 2020 at 11:29 AM Richard Guy Briggs <rgb@redhat.com> wrote:
> > On 2020-01-22 16:28, Paul Moore wrote:
> > > On Tue, Dec 31, 2019 at 2:50 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> > > >
> > > > Add audit container identifier support to the action of signalling the
> > > > audit daemon.
> > > >
> > > > Since this would need to add an element to the audit_sig_info struct,
> > > > a new record type AUDIT_SIGNAL_INFO2 was created with a new
> > > > audit_sig_info2 struct.  Corresponding support is required in the
> > > > userspace code to reflect the new record request and reply type.
> > > > An older userspace won't break since it won't know to request this
> > > > record type.
> > > >
> > > > Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
> > > > ---
> > > >  include/linux/audit.h       |  7 +++++++
> > > >  include/uapi/linux/audit.h  |  1 +
> > > >  kernel/audit.c              | 35 +++++++++++++++++++++++++++++++++++
> > > >  kernel/audit.h              |  1 +
> > > >  security/selinux/nlmsgtab.c |  1 +
> > > >  5 files changed, 45 insertions(+)
> > >
> > > ...
> > >
> > > > diff --git a/kernel/audit.c b/kernel/audit.c
> > > > index 0871c3e5d6df..51159c94041c 100644
> > > > --- a/kernel/audit.c
> > > > +++ b/kernel/audit.c
> > > > @@ -126,6 +126,14 @@ struct auditd_connection {
> > > >  kuid_t         audit_sig_uid = INVALID_UID;
> > > >  pid_t          audit_sig_pid = -1;
> > > >  u32            audit_sig_sid = 0;
> > > > +/* Since the signal information is stored in the record buffer at the
> > > > + * time of the signal, but not retrieved until later, there is a chance
> > > > + * that the last process in the container could terminate before the
> > > > + * signal record is delivered.  In this circumstance, there is a chance
> > > > + * the orchestrator could reuse the audit container identifier, causing
> > > > + * an overlap of audit records that refer to the same audit container
> > > > + * identifier, but a different container instance.  */
> > > > +u64            audit_sig_cid = AUDIT_CID_UNSET;
> > >
> > > I believe we could prevent the case mentioned above by taking an
> > > additional reference to the audit container ID object when the signal
> > > information is collected, dropping it only after the signal
> > > information is collected by userspace or another process signals the
> > > audit daemon.  Yes, it would block that audit container ID from being
> > > reused immediately, but since we are talking about one number out of
> > > 2^64 that seems like a reasonable tradeoff.
> >
> > I had thought that through and should have been more explicit about that
> > situation when I documented it.  We could do that, but then the syscall
> > records would be connected with the call from auditd on shutdown to
> > request that signal information, rather than the exit of that last
> > process that was using that container.  This strikes me as misleading.
> > Is that really what we want?
> 
>  ???
> 
> I think one of us is not understanding the other; maybe it's me, maybe
> it's you, maybe it's both of us.
> 
> Anyway, here is what I was trying to convey with my original comment
> ... When we record the audit container ID in audit_signal_info() we
> take an extra reference to the audit container ID object so that it
> will not disappear (and get reused) until after we respond with an
> AUDIT_SIGNAL_INFO2.  In audit_receive_msg() when we do the
> AUDIT_SIGNAL_INFO2 processing we drop the extra reference we took in
> audit_signal_info().  Unless I'm missing some other change you made,
> this *shouldn't* affect the syscall records, all it does is preserve
> the audit container ID object in the kernel's ACID store so it doesn't
> get reused.

This is exactly what I had understood.  I hadn't considered the extra
details below in detail due to my original syscall concern, but they
make sense.

The syscall I refer to is the one connected with the drop of the
audit container identifier by the last process that was in that
container in patch 5/16.  The production of this record is contingent on
the last ref in a contobj being dropped.  So if it is due to that ref
being maintained by audit_signal_info() until the AUDIT_SIGNAL_INFO2
record it fetched, then it will appear that the fetch action closed the
container rather than the last process in the container to exit.

Does this make sense?

> (We do need to do some extra housekeeping in audit_signal_info() to
> deal with the case where nobody asks for AUDIT_SIGNAL_INFO2 -
> basically if audit_sig_cid is not NULL we should drop a reference
> before assigning it a new object pointer, and of course we would need
> to set audit_sig_cid to NULL in audit_receive_msg() after sending it
> up to userspace and dropping the extra ref.)
> 
> paul moore

- RGB

--
Richard Guy Briggs <rgb@redhat.com>
Sr. S/W Engineer, Kernel Security, Base Operating Systems
Remote, Ottawa, Red Hat Canada
IRC: rgb, SunRaycer
Voice: +1.647.777.2635, Internal: (81) 32635
Paul Moore Jan. 23, 2020, 9:35 p.m. UTC | #5
On Thu, Jan 23, 2020 at 3:04 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> On 2020-01-23 12:09, Paul Moore wrote:
> > On Thu, Jan 23, 2020 at 11:29 AM Richard Guy Briggs <rgb@redhat.com> wrote:
> > > On 2020-01-22 16:28, Paul Moore wrote:
> > > > On Tue, Dec 31, 2019 at 2:50 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> > > > >
> > > > > Add audit container identifier support to the action of signalling the
> > > > > audit daemon.
> > > > >
> > > > > Since this would need to add an element to the audit_sig_info struct,
> > > > > a new record type AUDIT_SIGNAL_INFO2 was created with a new
> > > > > audit_sig_info2 struct.  Corresponding support is required in the
> > > > > userspace code to reflect the new record request and reply type.
> > > > > An older userspace won't break since it won't know to request this
> > > > > record type.
> > > > >
> > > > > Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
> > > > > ---
> > > > >  include/linux/audit.h       |  7 +++++++
> > > > >  include/uapi/linux/audit.h  |  1 +
> > > > >  kernel/audit.c              | 35 +++++++++++++++++++++++++++++++++++
> > > > >  kernel/audit.h              |  1 +
> > > > >  security/selinux/nlmsgtab.c |  1 +
> > > > >  5 files changed, 45 insertions(+)
> > > >
> > > > ...
> > > >
> > > > > diff --git a/kernel/audit.c b/kernel/audit.c
> > > > > index 0871c3e5d6df..51159c94041c 100644
> > > > > --- a/kernel/audit.c
> > > > > +++ b/kernel/audit.c
> > > > > @@ -126,6 +126,14 @@ struct auditd_connection {
> > > > >  kuid_t         audit_sig_uid = INVALID_UID;
> > > > >  pid_t          audit_sig_pid = -1;
> > > > >  u32            audit_sig_sid = 0;
> > > > > +/* Since the signal information is stored in the record buffer at the
> > > > > + * time of the signal, but not retrieved until later, there is a chance
> > > > > + * that the last process in the container could terminate before the
> > > > > + * signal record is delivered.  In this circumstance, there is a chance
> > > > > + * the orchestrator could reuse the audit container identifier, causing
> > > > > + * an overlap of audit records that refer to the same audit container
> > > > > + * identifier, but a different container instance.  */
> > > > > +u64            audit_sig_cid = AUDIT_CID_UNSET;
> > > >
> > > > I believe we could prevent the case mentioned above by taking an
> > > > additional reference to the audit container ID object when the signal
> > > > information is collected, dropping it only after the signal
> > > > information is collected by userspace or another process signals the
> > > > audit daemon.  Yes, it would block that audit container ID from being
> > > > reused immediately, but since we are talking about one number out of
> > > > 2^64 that seems like a reasonable tradeoff.
> > >
> > > I had thought that through and should have been more explicit about that
> > > situation when I documented it.  We could do that, but then the syscall
> > > records would be connected with the call from auditd on shutdown to
> > > request that signal information, rather than the exit of that last
> > > process that was using that container.  This strikes me as misleading.
> > > Is that really what we want?
> >
> >  ???
> >
> > I think one of us is not understanding the other; maybe it's me, maybe
> > it's you, maybe it's both of us.
> >
> > Anyway, here is what I was trying to convey with my original comment
> > ... When we record the audit container ID in audit_signal_info() we
> > take an extra reference to the audit container ID object so that it
> > will not disappear (and get reused) until after we respond with an
> > AUDIT_SIGNAL_INFO2.  In audit_receive_msg() when we do the
> > AUDIT_SIGNAL_INFO2 processing we drop the extra reference we took in
> > audit_signal_info().  Unless I'm missing some other change you made,
> > this *shouldn't* affect the syscall records, all it does is preserve
> > the audit container ID object in the kernel's ACID store so it doesn't
> > get reused.
>
> This is exactly what I had understood.  I hadn't considered the extra
> details below in detail due to my original syscall concern, but they
> make sense.
>
> The syscall I refer to is the one connected with the drop of the
> audit container identifier by the last process that was in that
> container in patch 5/16.  The production of this record is contingent on
> the last ref in a contobj being dropped.  So if it is due to that ref
> being maintained by audit_signal_info() until the AUDIT_SIGNAL_INFO2
> record it fetched, then it will appear that the fetch action closed the
> container rather than the last process in the container to exit.
>
> Does this make sense?

More so than your original reply, at least to me anyway.

It makes sense that the audit container ID wouldn't be marked as
"dead" since it would still be very much alive and available for use
by the orchestrator, the question is if that is desirable or not.  I
think the answer to this comes down the preserving the correctness of
the audit log.

If the audit container ID reported by AUDIT_SIGNAL_INFO2 has been
reused then I think there is a legitimate concern that the audit log
is not correct, and could be misleading.  If we solve that by grabbing
an extra reference, then there could also be some confusion as
userspace considers a container to be "dead" while the audit container
ID still exists in the kernel, and the kernel generated audit
container ID death record will not be generated until much later (and
possibly be associated with a different event, but that could be
solved by unassociating the container death record).  Of the two
approaches, I think the latter is safer in that it preserves the
correctness of the audit log, even though it could result in a delay
of the container death record.

Neither way is perfect, so if you have any other ideas I'm all ears.

> > (We do need to do some extra housekeeping in audit_signal_info() to
> > deal with the case where nobody asks for AUDIT_SIGNAL_INFO2 -
> > basically if audit_sig_cid is not NULL we should drop a reference
> > before assigning it a new object pointer, and of course we would need
> > to set audit_sig_cid to NULL in audit_receive_msg() after sending it
> > up to userspace and dropping the extra ref.)
Richard Guy Briggs Feb. 4, 2020, 11:14 p.m. UTC | #6
On 2020-01-23 16:35, Paul Moore wrote:
> On Thu, Jan 23, 2020 at 3:04 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> > On 2020-01-23 12:09, Paul Moore wrote:
> > > On Thu, Jan 23, 2020 at 11:29 AM Richard Guy Briggs <rgb@redhat.com> wrote:
> > > > On 2020-01-22 16:28, Paul Moore wrote:
> > > > > On Tue, Dec 31, 2019 at 2:50 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> > > > > >
> > > > > > Add audit container identifier support to the action of signalling the
> > > > > > audit daemon.
> > > > > >
> > > > > > Since this would need to add an element to the audit_sig_info struct,
> > > > > > a new record type AUDIT_SIGNAL_INFO2 was created with a new
> > > > > > audit_sig_info2 struct.  Corresponding support is required in the
> > > > > > userspace code to reflect the new record request and reply type.
> > > > > > An older userspace won't break since it won't know to request this
> > > > > > record type.
> > > > > >
> > > > > > Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
> > > > > > ---
> > > > > >  include/linux/audit.h       |  7 +++++++
> > > > > >  include/uapi/linux/audit.h  |  1 +
> > > > > >  kernel/audit.c              | 35 +++++++++++++++++++++++++++++++++++
> > > > > >  kernel/audit.h              |  1 +
> > > > > >  security/selinux/nlmsgtab.c |  1 +
> > > > > >  5 files changed, 45 insertions(+)
> > > > >
> > > > > ...
> > > > >
> > > > > > diff --git a/kernel/audit.c b/kernel/audit.c
> > > > > > index 0871c3e5d6df..51159c94041c 100644
> > > > > > --- a/kernel/audit.c
> > > > > > +++ b/kernel/audit.c
> > > > > > @@ -126,6 +126,14 @@ struct auditd_connection {
> > > > > >  kuid_t         audit_sig_uid = INVALID_UID;
> > > > > >  pid_t          audit_sig_pid = -1;
> > > > > >  u32            audit_sig_sid = 0;
> > > > > > +/* Since the signal information is stored in the record buffer at the
> > > > > > + * time of the signal, but not retrieved until later, there is a chance
> > > > > > + * that the last process in the container could terminate before the
> > > > > > + * signal record is delivered.  In this circumstance, there is a chance
> > > > > > + * the orchestrator could reuse the audit container identifier, causing
> > > > > > + * an overlap of audit records that refer to the same audit container
> > > > > > + * identifier, but a different container instance.  */
> > > > > > +u64            audit_sig_cid = AUDIT_CID_UNSET;
> > > > >
> > > > > I believe we could prevent the case mentioned above by taking an
> > > > > additional reference to the audit container ID object when the signal
> > > > > information is collected, dropping it only after the signal
> > > > > information is collected by userspace or another process signals the
> > > > > audit daemon.  Yes, it would block that audit container ID from being
> > > > > reused immediately, but since we are talking about one number out of
> > > > > 2^64 that seems like a reasonable tradeoff.
> > > >
> > > > I had thought that through and should have been more explicit about that
> > > > situation when I documented it.  We could do that, but then the syscall
> > > > records would be connected with the call from auditd on shutdown to
> > > > request that signal information, rather than the exit of that last
> > > > process that was using that container.  This strikes me as misleading.
> > > > Is that really what we want?
> > >
> > >  ???
> > >
> > > I think one of us is not understanding the other; maybe it's me, maybe
> > > it's you, maybe it's both of us.
> > >
> > > Anyway, here is what I was trying to convey with my original comment
> > > ... When we record the audit container ID in audit_signal_info() we
> > > take an extra reference to the audit container ID object so that it
> > > will not disappear (and get reused) until after we respond with an
> > > AUDIT_SIGNAL_INFO2.  In audit_receive_msg() when we do the
> > > AUDIT_SIGNAL_INFO2 processing we drop the extra reference we took in
> > > audit_signal_info().  Unless I'm missing some other change you made,
> > > this *shouldn't* affect the syscall records, all it does is preserve
> > > the audit container ID object in the kernel's ACID store so it doesn't
> > > get reused.
> >
> > This is exactly what I had understood.  I hadn't considered the extra
> > details below in detail due to my original syscall concern, but they
> > make sense.
> >
> > The syscall I refer to is the one connected with the drop of the
> > audit container identifier by the last process that was in that
> > container in patch 5/16.  The production of this record is contingent on
> > the last ref in a contobj being dropped.  So if it is due to that ref
> > being maintained by audit_signal_info() until the AUDIT_SIGNAL_INFO2
> > record it fetched, then it will appear that the fetch action closed the
> > container rather than the last process in the container to exit.
> >
> > Does this make sense?
> 
> More so than your original reply, at least to me anyway.
> 
> It makes sense that the audit container ID wouldn't be marked as
> "dead" since it would still be very much alive and available for use
> by the orchestrator, the question is if that is desirable or not.  I
> think the answer to this comes down the preserving the correctness of
> the audit log.
> 
> If the audit container ID reported by AUDIT_SIGNAL_INFO2 has been
> reused then I think there is a legitimate concern that the audit log
> is not correct, and could be misleading.  If we solve that by grabbing
> an extra reference, then there could also be some confusion as
> userspace considers a container to be "dead" while the audit container
> ID still exists in the kernel, and the kernel generated audit
> container ID death record will not be generated until much later (and
> possibly be associated with a different event, but that could be
> solved by unassociating the container death record).

How does syscall association of the death record with AUDIT_SIGNAL_INFO2
possibly get associated with another event?  Or is the syscall
association with the fetch for the AUDIT_SIGNAL_INFO2 the other event?

Another idea might be to bump the refcount in audit_signal_info() but
mark tht contid as dead so it can't be reused if we are concerned that
the dead contid be reused?

There is still the problem later that the reported contid is incomplete
compared to the rest of the contid reporting cycle wrt nesting since
AUDIT_SIGNAL_INFO2 will need to be more complex w/2 variable length
fields to accommodate a nested contid list.

>  Of the two
> approaches, I think the latter is safer in that it preserves the
> correctness of the audit log, even though it could result in a delay
> of the container death record.

I prefer the former since it strongly indicates last task in the
container.  The AUDIT_SIGNAL_INFO2 msg has the pid and other subject
attributes and the contid to strongly link the responsible party.

> Neither way is perfect, so if you have any other ideas I'm all ears.
> 
> > > (We do need to do some extra housekeeping in audit_signal_info() to
> > > deal with the case where nobody asks for AUDIT_SIGNAL_INFO2 -
> > > basically if audit_sig_cid is not NULL we should drop a reference
> > > before assigning it a new object pointer, and of course we would need
> > > to set audit_sig_cid to NULL in audit_receive_msg() after sending it
> > > up to userspace and dropping the extra ref.)
> 
> -- 
> paul moore
> www.paul-moore.com
> 

- RGB

--
Richard Guy Briggs <rgb@redhat.com>
Sr. S/W Engineer, Kernel Security, Base Operating Systems
Remote, Ottawa, Red Hat Canada
IRC: rgb, SunRaycer
Voice: +1.647.777.2635, Internal: (81) 32635
Paul Moore Feb. 5, 2020, 10:50 p.m. UTC | #7
On Tue, Feb 4, 2020 at 6:15 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> On 2020-01-23 16:35, Paul Moore wrote:
> > On Thu, Jan 23, 2020 at 3:04 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> > > On 2020-01-23 12:09, Paul Moore wrote:
> > > > On Thu, Jan 23, 2020 at 11:29 AM Richard Guy Briggs <rgb@redhat.com> wrote:
> > > > > On 2020-01-22 16:28, Paul Moore wrote:
> > > > > > On Tue, Dec 31, 2019 at 2:50 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> > > > > > >
> > > > > > > Add audit container identifier support to the action of signalling the
> > > > > > > audit daemon.
> > > > > > >
> > > > > > > Since this would need to add an element to the audit_sig_info struct,
> > > > > > > a new record type AUDIT_SIGNAL_INFO2 was created with a new
> > > > > > > audit_sig_info2 struct.  Corresponding support is required in the
> > > > > > > userspace code to reflect the new record request and reply type.
> > > > > > > An older userspace won't break since it won't know to request this
> > > > > > > record type.
> > > > > > >
> > > > > > > Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
> > > > > > > ---
> > > > > > >  include/linux/audit.h       |  7 +++++++
> > > > > > >  include/uapi/linux/audit.h  |  1 +
> > > > > > >  kernel/audit.c              | 35 +++++++++++++++++++++++++++++++++++
> > > > > > >  kernel/audit.h              |  1 +
> > > > > > >  security/selinux/nlmsgtab.c |  1 +
> > > > > > >  5 files changed, 45 insertions(+)
> > > > > >
> > > > > > ...
> > > > > >
> > > > > > > diff --git a/kernel/audit.c b/kernel/audit.c
> > > > > > > index 0871c3e5d6df..51159c94041c 100644
> > > > > > > --- a/kernel/audit.c
> > > > > > > +++ b/kernel/audit.c
> > > > > > > @@ -126,6 +126,14 @@ struct auditd_connection {
> > > > > > >  kuid_t         audit_sig_uid = INVALID_UID;
> > > > > > >  pid_t          audit_sig_pid = -1;
> > > > > > >  u32            audit_sig_sid = 0;
> > > > > > > +/* Since the signal information is stored in the record buffer at the
> > > > > > > + * time of the signal, but not retrieved until later, there is a chance
> > > > > > > + * that the last process in the container could terminate before the
> > > > > > > + * signal record is delivered.  In this circumstance, there is a chance
> > > > > > > + * the orchestrator could reuse the audit container identifier, causing
> > > > > > > + * an overlap of audit records that refer to the same audit container
> > > > > > > + * identifier, but a different container instance.  */
> > > > > > > +u64            audit_sig_cid = AUDIT_CID_UNSET;
> > > > > >
> > > > > > I believe we could prevent the case mentioned above by taking an
> > > > > > additional reference to the audit container ID object when the signal
> > > > > > information is collected, dropping it only after the signal
> > > > > > information is collected by userspace or another process signals the
> > > > > > audit daemon.  Yes, it would block that audit container ID from being
> > > > > > reused immediately, but since we are talking about one number out of
> > > > > > 2^64 that seems like a reasonable tradeoff.
> > > > >
> > > > > I had thought that through and should have been more explicit about that
> > > > > situation when I documented it.  We could do that, but then the syscall
> > > > > records would be connected with the call from auditd on shutdown to
> > > > > request that signal information, rather than the exit of that last
> > > > > process that was using that container.  This strikes me as misleading.
> > > > > Is that really what we want?
> > > >
> > > >  ???
> > > >
> > > > I think one of us is not understanding the other; maybe it's me, maybe
> > > > it's you, maybe it's both of us.
> > > >
> > > > Anyway, here is what I was trying to convey with my original comment
> > > > ... When we record the audit container ID in audit_signal_info() we
> > > > take an extra reference to the audit container ID object so that it
> > > > will not disappear (and get reused) until after we respond with an
> > > > AUDIT_SIGNAL_INFO2.  In audit_receive_msg() when we do the
> > > > AUDIT_SIGNAL_INFO2 processing we drop the extra reference we took in
> > > > audit_signal_info().  Unless I'm missing some other change you made,
> > > > this *shouldn't* affect the syscall records, all it does is preserve
> > > > the audit container ID object in the kernel's ACID store so it doesn't
> > > > get reused.
> > >
> > > This is exactly what I had understood.  I hadn't considered the extra
> > > details below in detail due to my original syscall concern, but they
> > > make sense.
> > >
> > > The syscall I refer to is the one connected with the drop of the
> > > audit container identifier by the last process that was in that
> > > container in patch 5/16.  The production of this record is contingent on
> > > the last ref in a contobj being dropped.  So if it is due to that ref
> > > being maintained by audit_signal_info() until the AUDIT_SIGNAL_INFO2
> > > record it fetched, then it will appear that the fetch action closed the
> > > container rather than the last process in the container to exit.
> > >
> > > Does this make sense?
> >
> > More so than your original reply, at least to me anyway.
> >
> > It makes sense that the audit container ID wouldn't be marked as
> > "dead" since it would still be very much alive and available for use
> > by the orchestrator, the question is if that is desirable or not.  I
> > think the answer to this comes down the preserving the correctness of
> > the audit log.
> >
> > If the audit container ID reported by AUDIT_SIGNAL_INFO2 has been
> > reused then I think there is a legitimate concern that the audit log
> > is not correct, and could be misleading.  If we solve that by grabbing
> > an extra reference, then there could also be some confusion as
> > userspace considers a container to be "dead" while the audit container
> > ID still exists in the kernel, and the kernel generated audit
> > container ID death record will not be generated until much later (and
> > possibly be associated with a different event, but that could be
> > solved by unassociating the container death record).
>
> How does syscall association of the death record with AUDIT_SIGNAL_INFO2
> possibly get associated with another event?  Or is the syscall
> association with the fetch for the AUDIT_SIGNAL_INFO2 the other event?

The issue is when does the audit container ID "die".  If it is when
the last task in the container exits, then the death record will be
associated when the task's exit.  If the audit container ID lives on
until the last reference of it in the audit logs, including the
SIGNAL_INFO2 message, the death record will be associated with the
related SIGNAL_INFO2 syscalls, or perhaps unassociated depending on
the details of the syscalls/netlink.

> Another idea might be to bump the refcount in audit_signal_info() but
> mark tht contid as dead so it can't be reused if we are concerned that
> the dead contid be reused?

Ooof.  Yes, maybe, but that would be ugly.

> There is still the problem later that the reported contid is incomplete
> compared to the rest of the contid reporting cycle wrt nesting since
> AUDIT_SIGNAL_INFO2 will need to be more complex w/2 variable length
> fields to accommodate a nested contid list.

Do we really care about the full nested audit container ID list in the
SIGNAL_INFO2 record?

> >  Of the two
> > approaches, I think the latter is safer in that it preserves the
> > correctness of the audit log, even though it could result in a delay
> > of the container death record.
>
> I prefer the former since it strongly indicates last task in the
> container.  The AUDIT_SIGNAL_INFO2 msg has the pid and other subject
> attributes and the contid to strongly link the responsible party.

Steve is the only one who really tracks the security certifications
that are relevant to audit, see what the certification requirements
have to say and we can revisit this.
Steve Grubb Feb. 12, 2020, 10:38 p.m. UTC | #8
On Wednesday, February 5, 2020 5:50:28 PM EST Paul Moore wrote:
> > > > > ... When we record the audit container ID in audit_signal_info() we
> > > > > take an extra reference to the audit container ID object so that it
> > > > > will not disappear (and get reused) until after we respond with an
> > > > > AUDIT_SIGNAL_INFO2.  In audit_receive_msg() when we do the
> > > > > AUDIT_SIGNAL_INFO2 processing we drop the extra reference we took
> > > > > in
> > > > > audit_signal_info().  Unless I'm missing some other change you
> > > > > made,
> > > > > this *shouldn't* affect the syscall records, all it does is
> > > > > preserve
> > > > > the audit container ID object in the kernel's ACID store so it
> > > > > doesn't
> > > > > get reused.
> > > > 
> > > > This is exactly what I had understood.  I hadn't considered the extra
> > > > details below in detail due to my original syscall concern, but they
> > > > make sense.
> > > > 
> > > > The syscall I refer to is the one connected with the drop of the
> > > > audit container identifier by the last process that was in that
> > > > container in patch 5/16.  The production of this record is contingent
> > > > on
> > > > the last ref in a contobj being dropped.  So if it is due to that ref
> > > > being maintained by audit_signal_info() until the AUDIT_SIGNAL_INFO2
> > > > record it fetched, then it will appear that the fetch action closed
> > > > the
> > > > container rather than the last process in the container to exit.
> > > > 
> > > > Does this make sense?
> > > 
> > > More so than your original reply, at least to me anyway.
> > > 
> > > It makes sense that the audit container ID wouldn't be marked as
> > > "dead" since it would still be very much alive and available for use
> > > by the orchestrator, the question is if that is desirable or not.  I
> > > think the answer to this comes down the preserving the correctness of
> > > the audit log.
> > > 
> > > If the audit container ID reported by AUDIT_SIGNAL_INFO2 has been
> > > reused then I think there is a legitimate concern that the audit log
> > > is not correct, and could be misleading.  If we solve that by grabbing
> > > an extra reference, then there could also be some confusion as
> > > userspace considers a container to be "dead" while the audit container
> > > ID still exists in the kernel, and the kernel generated audit
> > > container ID death record will not be generated until much later (and
> > > possibly be associated with a different event, but that could be
> > > solved by unassociating the container death record).
> > 
> > How does syscall association of the death record with AUDIT_SIGNAL_INFO2
> > possibly get associated with another event?  Or is the syscall
> > association with the fetch for the AUDIT_SIGNAL_INFO2 the other event?
> 
> The issue is when does the audit container ID "die".  If it is when
> the last task in the container exits, then the death record will be
> associated when the task's exit.  If the audit container ID lives on
> until the last reference of it in the audit logs, including the
> SIGNAL_INFO2 message, the death record will be associated with the
> related SIGNAL_INFO2 syscalls, or perhaps unassociated depending on
> the details of the syscalls/netlink.
> 
> > Another idea might be to bump the refcount in audit_signal_info() but
> > mark tht contid as dead so it can't be reused if we are concerned that
> > the dead contid be reused?
> 
> Ooof.  Yes, maybe, but that would be ugly.
> 
> > There is still the problem later that the reported contid is incomplete
> > compared to the rest of the contid reporting cycle wrt nesting since
> > AUDIT_SIGNAL_INFO2 will need to be more complex w/2 variable length
> > fields to accommodate a nested contid list.
> 
> Do we really care about the full nested audit container ID list in the
> SIGNAL_INFO2 record?
> 
> > > Of the two
> > > approaches, I think the latter is safer in that it preserves the
> > > correctness of the audit log, even though it could result in a delay
> > > of the container death record.
> > 
> > I prefer the former since it strongly indicates last task in the
> > container.  The AUDIT_SIGNAL_INFO2 msg has the pid and other subject
> > attributes and the contid to strongly link the responsible party.
> 
> Steve is the only one who really tracks the security certifications
> that are relevant to audit, see what the certification requirements
> have to say and we can revisit this.

Sever Virtualization Protection Profile is the closest applicable standard

https://www.niap-ccevs.org/Profile/Info.cfm?PPID=408&id=408

It is silent on audit requirements for the lifecycle of a VM. I assume that 
all that is needed is what the orchestrator says its doing at the high level. 
So, if an orchestrator wants to shutdown a container, the orchestrator must 
log that intent and its results. In a similar fashion, systemd logs that it's 
killing a service and we don't actually hook the exit syscall of the service 
to record that.

Now, if a container was being used as a VPS, and it had a fully functioning 
userspace, it's own services, and its very own audit daemon, then in this 
case it would care who sent a signal to its auditd. The tenant of that 
container may have to comply with PCI-DSS or something else. It would log the 
audit service is being terminated and systemd would record that its tearing 
down the environment. The OS doesn't need to do anything.

-Steve
Paul Moore Feb. 13, 2020, 12:09 a.m. UTC | #9
On Wed, Feb 12, 2020 at 5:39 PM Steve Grubb <sgrubb@redhat.com> wrote:
> On Wednesday, February 5, 2020 5:50:28 PM EST Paul Moore wrote:
> > > > > > ... When we record the audit container ID in audit_signal_info() we
> > > > > > take an extra reference to the audit container ID object so that it
> > > > > > will not disappear (and get reused) until after we respond with an
> > > > > > AUDIT_SIGNAL_INFO2.  In audit_receive_msg() when we do the
> > > > > > AUDIT_SIGNAL_INFO2 processing we drop the extra reference we took
> > > > > > in
> > > > > > audit_signal_info().  Unless I'm missing some other change you
> > > > > > made,
> > > > > > this *shouldn't* affect the syscall records, all it does is
> > > > > > preserve
> > > > > > the audit container ID object in the kernel's ACID store so it
> > > > > > doesn't
> > > > > > get reused.
> > > > >
> > > > > This is exactly what I had understood.  I hadn't considered the extra
> > > > > details below in detail due to my original syscall concern, but they
> > > > > make sense.
> > > > >
> > > > > The syscall I refer to is the one connected with the drop of the
> > > > > audit container identifier by the last process that was in that
> > > > > container in patch 5/16.  The production of this record is contingent
> > > > > on
> > > > > the last ref in a contobj being dropped.  So if it is due to that ref
> > > > > being maintained by audit_signal_info() until the AUDIT_SIGNAL_INFO2
> > > > > record it fetched, then it will appear that the fetch action closed
> > > > > the
> > > > > container rather than the last process in the container to exit.
> > > > >
> > > > > Does this make sense?
> > > >
> > > > More so than your original reply, at least to me anyway.
> > > >
> > > > It makes sense that the audit container ID wouldn't be marked as
> > > > "dead" since it would still be very much alive and available for use
> > > > by the orchestrator, the question is if that is desirable or not.  I
> > > > think the answer to this comes down the preserving the correctness of
> > > > the audit log.
> > > >
> > > > If the audit container ID reported by AUDIT_SIGNAL_INFO2 has been
> > > > reused then I think there is a legitimate concern that the audit log
> > > > is not correct, and could be misleading.  If we solve that by grabbing
> > > > an extra reference, then there could also be some confusion as
> > > > userspace considers a container to be "dead" while the audit container
> > > > ID still exists in the kernel, and the kernel generated audit
> > > > container ID death record will not be generated until much later (and
> > > > possibly be associated with a different event, but that could be
> > > > solved by unassociating the container death record).
> > >
> > > How does syscall association of the death record with AUDIT_SIGNAL_INFO2
> > > possibly get associated with another event?  Or is the syscall
> > > association with the fetch for the AUDIT_SIGNAL_INFO2 the other event?
> >
> > The issue is when does the audit container ID "die".  If it is when
> > the last task in the container exits, then the death record will be
> > associated when the task's exit.  If the audit container ID lives on
> > until the last reference of it in the audit logs, including the
> > SIGNAL_INFO2 message, the death record will be associated with the
> > related SIGNAL_INFO2 syscalls, or perhaps unassociated depending on
> > the details of the syscalls/netlink.
> >
> > > Another idea might be to bump the refcount in audit_signal_info() but
> > > mark tht contid as dead so it can't be reused if we are concerned that
> > > the dead contid be reused?
> >
> > Ooof.  Yes, maybe, but that would be ugly.
> >
> > > There is still the problem later that the reported contid is incomplete
> > > compared to the rest of the contid reporting cycle wrt nesting since
> > > AUDIT_SIGNAL_INFO2 will need to be more complex w/2 variable length
> > > fields to accommodate a nested contid list.
> >
> > Do we really care about the full nested audit container ID list in the
> > SIGNAL_INFO2 record?
> >
> > > > Of the two
> > > > approaches, I think the latter is safer in that it preserves the
> > > > correctness of the audit log, even though it could result in a delay
> > > > of the container death record.
> > >
> > > I prefer the former since it strongly indicates last task in the
> > > container.  The AUDIT_SIGNAL_INFO2 msg has the pid and other subject
> > > attributes and the contid to strongly link the responsible party.
> >
> > Steve is the only one who really tracks the security certifications
> > that are relevant to audit, see what the certification requirements
> > have to say and we can revisit this.
>
> Sever Virtualization Protection Profile is the closest applicable standard
>
> https://www.niap-ccevs.org/Profile/Info.cfm?PPID=408&id=408
>
> It is silent on audit requirements for the lifecycle of a VM. I assume that
> all that is needed is what the orchestrator says its doing at the high level.
> So, if an orchestrator wants to shutdown a container, the orchestrator must
> log that intent and its results. In a similar fashion, systemd logs that it's
> killing a service and we don't actually hook the exit syscall of the service
> to record that.
>
> Now, if a container was being used as a VPS, and it had a fully functioning
> userspace, it's own services, and its very own audit daemon, then in this
> case it would care who sent a signal to its auditd. The tenant of that
> container may have to comply with PCI-DSS or something else. It would log the
> audit service is being terminated and systemd would record that its tearing
> down the environment. The OS doesn't need to do anything.

This latter case is the case of interest here, since the host auditd
should only be killed from a process on the host itself, not a process
running in a container.  If we work under the assumption (and this may
be a break in our approach to not defining "container") that an auditd
instance is only ever signaled by a process with the same audit
container ID (ACID), is this really even an issue?  Right now it isn't
as even with this patchset we will still really only support one
auditd instance, presumably on the host, so this isn't a significant
concern.  Moving forward, once we add support for multiple auditd
instances we will likely need to move the signal info into
(potentially) s per-ACID struct, a struct whose lifetime would match
that of the associated container by definition; as the auditd
container died, the struct would die, the refcounts dropped, and any
ACID held only the signal info refcount would be dropped/killed.

However, making this assumption would mean that we are expecting a
"container" to provide some level of isolation such that processes
with a different audit container ID do not signal each other.  From a
practical perspective I think that fits with the most (all?)
definitions of "container", but I can't say that for certain.  In
those cases where the assumption is not correct and processes can
signal each other across audit container ID boundaries, perhaps it is
enough to explain that an audit container ID may not fully disappear
until it has been fetched with a SIGNAL_INFO2 message.
Paul Moore Feb. 13, 2020, 9:44 p.m. UTC | #10
This is a bit of a thread-hijack, and for that I apologize, but
another thought crossed my mind while thinking about this issue
further ... Once we support multiple auditd instances, including the
necessary record routing and duplication/multiple-sends (the host
always sees *everything*), we will likely need to find a way to "trim"
the audit container ID (ACID) lists we send in the records.  The
auditd instance running on the host/initns will always see everything,
so it will want the full container ACID list; however an auditd
instance running inside a container really should only see the ACIDs
of any child containers.

For example, imagine a system where the host has containers 1 and 2,
each running an auditd instance.  Inside container 1 there are
containers A and B.  Inside container 2 there are containers Y and Z.
If an audit event is generated in container Z, I would expect the
host's auditd to see a ACID list of "1,Z" but container 1's auditd
should only see an ACID list of "Z".  The auditd running in container
2 should not see the record at all (that will be relatively
straightforward).  Does that make sense?  Do we have the record
formats properly designed to handle this without too much problem (I'm
not entirely sure we do)?
Richard Guy Briggs March 12, 2020, 7:30 p.m. UTC | #11
On 2020-02-13 16:44, Paul Moore wrote:
> This is a bit of a thread-hijack, and for that I apologize, but
> another thought crossed my mind while thinking about this issue
> further ... Once we support multiple auditd instances, including the
> necessary record routing and duplication/multiple-sends (the host
> always sees *everything*), we will likely need to find a way to "trim"
> the audit container ID (ACID) lists we send in the records.  The
> auditd instance running on the host/initns will always see everything,
> so it will want the full container ACID list; however an auditd
> instance running inside a container really should only see the ACIDs
> of any child containers.

Agreed.  This should be easy to check and limit, preventing an auditd
from seeing any contid that is a parent of its own contid.

> For example, imagine a system where the host has containers 1 and 2,
> each running an auditd instance.  Inside container 1 there are
> containers A and B.  Inside container 2 there are containers Y and Z.
> If an audit event is generated in container Z, I would expect the
> host's auditd to see a ACID list of "1,Z" but container 1's auditd
> should only see an ACID list of "Z".  The auditd running in container
> 2 should not see the record at all (that will be relatively
> straightforward).  Does that make sense?  Do we have the record
> formats properly designed to handle this without too much problem (I'm
> not entirely sure we do)?

I completely agree and I believe we have record formats that are able to
handle this already.

> paul moore

- RGB

--
Richard Guy Briggs <rgb@redhat.com>
Sr. S/W Engineer, Kernel Security, Base Operating Systems
Remote, Ottawa, Red Hat Canada
IRC: rgb, SunRaycer
Voice: +1.647.777.2635, Internal: (81) 32635
Richard Guy Briggs March 12, 2020, 8:27 p.m. UTC | #12
On 2020-02-12 19:09, Paul Moore wrote:
> On Wed, Feb 12, 2020 at 5:39 PM Steve Grubb <sgrubb@redhat.com> wrote:
> > On Wednesday, February 5, 2020 5:50:28 PM EST Paul Moore wrote:
> > > > > > > ... When we record the audit container ID in audit_signal_info() we
> > > > > > > take an extra reference to the audit container ID object so that it
> > > > > > > will not disappear (and get reused) until after we respond with an
> > > > > > > AUDIT_SIGNAL_INFO2.  In audit_receive_msg() when we do the
> > > > > > > AUDIT_SIGNAL_INFO2 processing we drop the extra reference we took
> > > > > > > in
> > > > > > > audit_signal_info().  Unless I'm missing some other change you
> > > > > > > made,
> > > > > > > this *shouldn't* affect the syscall records, all it does is
> > > > > > > preserve
> > > > > > > the audit container ID object in the kernel's ACID store so it
> > > > > > > doesn't
> > > > > > > get reused.
> > > > > >
> > > > > > This is exactly what I had understood.  I hadn't considered the extra
> > > > > > details below in detail due to my original syscall concern, but they
> > > > > > make sense.
> > > > > >
> > > > > > The syscall I refer to is the one connected with the drop of the
> > > > > > audit container identifier by the last process that was in that
> > > > > > container in patch 5/16.  The production of this record is contingent
> > > > > > on
> > > > > > the last ref in a contobj being dropped.  So if it is due to that ref
> > > > > > being maintained by audit_signal_info() until the AUDIT_SIGNAL_INFO2
> > > > > > record it fetched, then it will appear that the fetch action closed
> > > > > > the
> > > > > > container rather than the last process in the container to exit.
> > > > > >
> > > > > > Does this make sense?
> > > > >
> > > > > More so than your original reply, at least to me anyway.
> > > > >
> > > > > It makes sense that the audit container ID wouldn't be marked as
> > > > > "dead" since it would still be very much alive and available for use
> > > > > by the orchestrator, the question is if that is desirable or not.  I
> > > > > think the answer to this comes down the preserving the correctness of
> > > > > the audit log.
> > > > >
> > > > > If the audit container ID reported by AUDIT_SIGNAL_INFO2 has been
> > > > > reused then I think there is a legitimate concern that the audit log
> > > > > is not correct, and could be misleading.  If we solve that by grabbing
> > > > > an extra reference, then there could also be some confusion as
> > > > > userspace considers a container to be "dead" while the audit container
> > > > > ID still exists in the kernel, and the kernel generated audit
> > > > > container ID death record will not be generated until much later (and
> > > > > possibly be associated with a different event, but that could be
> > > > > solved by unassociating the container death record).
> > > >
> > > > How does syscall association of the death record with AUDIT_SIGNAL_INFO2
> > > > possibly get associated with another event?  Or is the syscall
> > > > association with the fetch for the AUDIT_SIGNAL_INFO2 the other event?
> > >
> > > The issue is when does the audit container ID "die".  If it is when
> > > the last task in the container exits, then the death record will be
> > > associated when the task's exit.  If the audit container ID lives on
> > > until the last reference of it in the audit logs, including the
> > > SIGNAL_INFO2 message, the death record will be associated with the
> > > related SIGNAL_INFO2 syscalls, or perhaps unassociated depending on
> > > the details of the syscalls/netlink.
> > >
> > > > Another idea might be to bump the refcount in audit_signal_info() but
> > > > mark tht contid as dead so it can't be reused if we are concerned that
> > > > the dead contid be reused?
> > >
> > > Ooof.  Yes, maybe, but that would be ugly.
> > >
> > > > There is still the problem later that the reported contid is incomplete
> > > > compared to the rest of the contid reporting cycle wrt nesting since
> > > > AUDIT_SIGNAL_INFO2 will need to be more complex w/2 variable length
> > > > fields to accommodate a nested contid list.
> > >
> > > Do we really care about the full nested audit container ID list in the
> > > SIGNAL_INFO2 record?

I'm inclined to hand-wave it away as inconvenient that can be looked up
more carefully if it is really needed.  Maybe the block above would be
safer and more complete even though it is ugly.

> > > > > Of the two
> > > > > approaches, I think the latter is safer in that it preserves the
> > > > > correctness of the audit log, even though it could result in a delay
> > > > > of the container death record.
> > > >
> > > > I prefer the former since it strongly indicates last task in the
> > > > container.  The AUDIT_SIGNAL_INFO2 msg has the pid and other subject
> > > > attributes and the contid to strongly link the responsible party.
> > >
> > > Steve is the only one who really tracks the security certifications
> > > that are relevant to audit, see what the certification requirements
> > > have to say and we can revisit this.
> >
> > Sever Virtualization Protection Profile is the closest applicable standard
> >
> > https://www.niap-ccevs.org/Profile/Info.cfm?PPID=408&id=408
> >
> > It is silent on audit requirements for the lifecycle of a VM. I assume that
> > all that is needed is what the orchestrator says its doing at the high level.
> > So, if an orchestrator wants to shutdown a container, the orchestrator must
> > log that intent and its results. In a similar fashion, systemd logs that it's
> > killing a service and we don't actually hook the exit syscall of the service
> > to record that.
> >
> > Now, if a container was being used as a VPS, and it had a fully functioning
> > userspace, it's own services, and its very own audit daemon, then in this
> > case it would care who sent a signal to its auditd. The tenant of that
> > container may have to comply with PCI-DSS or something else. It would log the
> > audit service is being terminated and systemd would record that its tearing
> > down the environment. The OS doesn't need to do anything.
> 
> This latter case is the case of interest here, since the host auditd
> should only be killed from a process on the host itself, not a process
> running in a container.  If we work under the assumption (and this may
> be a break in our approach to not defining "container") that an auditd
> instance is only ever signaled by a process with the same audit
> container ID (ACID), is this really even an issue?  Right now it isn't
> as even with this patchset we will still really only support one
> auditd instance, presumably on the host, so this isn't a significant
> concern.  Moving forward, once we add support for multiple auditd
> instances we will likely need to move the signal info into
> (potentially) s per-ACID struct, a struct whose lifetime would match
> that of the associated container by definition; as the auditd
> container died, the struct would die, the refcounts dropped, and any
> ACID held only the signal info refcount would be dropped/killed.

Any process could signal auditd if it can see it based on namespace
relationships, nevermind container placement.  Some container
architectures would not have a namespace configuration that would block
this (combination of PID/user/IPC?).

> However, making this assumption would mean that we are expecting a
> "container" to provide some level of isolation such that processes
> with a different audit container ID do not signal each other.  From a
> practical perspective I think that fits with the most (all?)
> definitions of "container", but I can't say that for certain.  In
> those cases where the assumption is not correct and processes can
> signal each other across audit container ID boundaries, perhaps it is
> enough to explain that an audit container ID may not fully disappear
> until it has been fetched with a SIGNAL_INFO2 message.

I think more and more, that more complete isolation is being done,
taking advantage of each type of namespace as they become available, but
I know a nuber of them didn't find it important yet to use IPC, PID or
user namespaces which would be the only namespaces I can think of that
would provide that isolation.

It isn't entirely clear to me which side you fall on this issue, Paul.
Can you pronounce on your strong preference one way or the other if the
death of a container coincide with the exit of the last process in that
namespace, or the fetch of any signal info related to it?  I have a bias
to the former since the code already does that and I feel the exit of
the last process is much more relevant supported by the syscall record,
but could change it to the latter if you feel strongly enough about it
to block upstream acceptance.

> paul moore

- RGB

--
Richard Guy Briggs <rgb@redhat.com>
Sr. S/W Engineer, Kernel Security, Base Operating Systems
Remote, Ottawa, Red Hat Canada
IRC: rgb, SunRaycer
Voice: +1.647.777.2635, Internal: (81) 32635
Paul Moore March 13, 2020, 4:29 p.m. UTC | #13
On Thu, Mar 12, 2020 at 3:30 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> On 2020-02-13 16:44, Paul Moore wrote:
> > This is a bit of a thread-hijack, and for that I apologize, but
> > another thought crossed my mind while thinking about this issue
> > further ... Once we support multiple auditd instances, including the
> > necessary record routing and duplication/multiple-sends (the host
> > always sees *everything*), we will likely need to find a way to "trim"
> > the audit container ID (ACID) lists we send in the records.  The
> > auditd instance running on the host/initns will always see everything,
> > so it will want the full container ACID list; however an auditd
> > instance running inside a container really should only see the ACIDs
> > of any child containers.
>
> Agreed.  This should be easy to check and limit, preventing an auditd
> from seeing any contid that is a parent of its own contid.
>
> > For example, imagine a system where the host has containers 1 and 2,
> > each running an auditd instance.  Inside container 1 there are
> > containers A and B.  Inside container 2 there are containers Y and Z.
> > If an audit event is generated in container Z, I would expect the
> > host's auditd to see a ACID list of "1,Z" but container 1's auditd
> > should only see an ACID list of "Z".  The auditd running in container
> > 2 should not see the record at all (that will be relatively
> > straightforward).  Does that make sense?  Do we have the record
> > formats properly designed to handle this without too much problem (I'm
> > not entirely sure we do)?
>
> I completely agree and I believe we have record formats that are able to
> handle this already.

I'm not convinced we do.  What about the cases where we have a field
with a list of audit container IDs?  How do we handle that?
Paul Moore March 13, 2020, 4:42 p.m. UTC | #14
On Thu, Mar 12, 2020 at 4:27 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> On 2020-02-12 19:09, Paul Moore wrote:
> > On Wed, Feb 12, 2020 at 5:39 PM Steve Grubb <sgrubb@redhat.com> wrote:
> > > On Wednesday, February 5, 2020 5:50:28 PM EST Paul Moore wrote:
> > > > > > > > ... When we record the audit container ID in audit_signal_info() we
> > > > > > > > take an extra reference to the audit container ID object so that it
> > > > > > > > will not disappear (and get reused) until after we respond with an
> > > > > > > > AUDIT_SIGNAL_INFO2.  In audit_receive_msg() when we do the
> > > > > > > > AUDIT_SIGNAL_INFO2 processing we drop the extra reference we took
> > > > > > > > in
> > > > > > > > audit_signal_info().  Unless I'm missing some other change you
> > > > > > > > made,
> > > > > > > > this *shouldn't* affect the syscall records, all it does is
> > > > > > > > preserve
> > > > > > > > the audit container ID object in the kernel's ACID store so it
> > > > > > > > doesn't
> > > > > > > > get reused.
> > > > > > >
> > > > > > > This is exactly what I had understood.  I hadn't considered the extra
> > > > > > > details below in detail due to my original syscall concern, but they
> > > > > > > make sense.
> > > > > > >
> > > > > > > The syscall I refer to is the one connected with the drop of the
> > > > > > > audit container identifier by the last process that was in that
> > > > > > > container in patch 5/16.  The production of this record is contingent
> > > > > > > on
> > > > > > > the last ref in a contobj being dropped.  So if it is due to that ref
> > > > > > > being maintained by audit_signal_info() until the AUDIT_SIGNAL_INFO2
> > > > > > > record it fetched, then it will appear that the fetch action closed
> > > > > > > the
> > > > > > > container rather than the last process in the container to exit.
> > > > > > >
> > > > > > > Does this make sense?
> > > > > >
> > > > > > More so than your original reply, at least to me anyway.
> > > > > >
> > > > > > It makes sense that the audit container ID wouldn't be marked as
> > > > > > "dead" since it would still be very much alive and available for use
> > > > > > by the orchestrator, the question is if that is desirable or not.  I
> > > > > > think the answer to this comes down the preserving the correctness of
> > > > > > the audit log.
> > > > > >
> > > > > > If the audit container ID reported by AUDIT_SIGNAL_INFO2 has been
> > > > > > reused then I think there is a legitimate concern that the audit log
> > > > > > is not correct, and could be misleading.  If we solve that by grabbing
> > > > > > an extra reference, then there could also be some confusion as
> > > > > > userspace considers a container to be "dead" while the audit container
> > > > > > ID still exists in the kernel, and the kernel generated audit
> > > > > > container ID death record will not be generated until much later (and
> > > > > > possibly be associated with a different event, but that could be
> > > > > > solved by unassociating the container death record).
> > > > >
> > > > > How does syscall association of the death record with AUDIT_SIGNAL_INFO2
> > > > > possibly get associated with another event?  Or is the syscall
> > > > > association with the fetch for the AUDIT_SIGNAL_INFO2 the other event?
> > > >
> > > > The issue is when does the audit container ID "die".  If it is when
> > > > the last task in the container exits, then the death record will be
> > > > associated when the task's exit.  If the audit container ID lives on
> > > > until the last reference of it in the audit logs, including the
> > > > SIGNAL_INFO2 message, the death record will be associated with the
> > > > related SIGNAL_INFO2 syscalls, or perhaps unassociated depending on
> > > > the details of the syscalls/netlink.
> > > >
> > > > > Another idea might be to bump the refcount in audit_signal_info() but
> > > > > mark tht contid as dead so it can't be reused if we are concerned that
> > > > > the dead contid be reused?
> > > >
> > > > Ooof.  Yes, maybe, but that would be ugly.
> > > >
> > > > > There is still the problem later that the reported contid is incomplete
> > > > > compared to the rest of the contid reporting cycle wrt nesting since
> > > > > AUDIT_SIGNAL_INFO2 will need to be more complex w/2 variable length
> > > > > fields to accommodate a nested contid list.
> > > >
> > > > Do we really care about the full nested audit container ID list in the
> > > > SIGNAL_INFO2 record?
>
> I'm inclined to hand-wave it away as inconvenient that can be looked up
> more carefully if it is really needed.  Maybe the block above would be
> safer and more complete even though it is ugly.
>
> > > > > > Of the two
> > > > > > approaches, I think the latter is safer in that it preserves the
> > > > > > correctness of the audit log, even though it could result in a delay
> > > > > > of the container death record.
> > > > >
> > > > > I prefer the former since it strongly indicates last task in the
> > > > > container.  The AUDIT_SIGNAL_INFO2 msg has the pid and other subject
> > > > > attributes and the contid to strongly link the responsible party.
> > > >
> > > > Steve is the only one who really tracks the security certifications
> > > > that are relevant to audit, see what the certification requirements
> > > > have to say and we can revisit this.
> > >
> > > Sever Virtualization Protection Profile is the closest applicable standard
> > >
> > > https://www.niap-ccevs.org/Profile/Info.cfm?PPID=408&id=408
> > >
> > > It is silent on audit requirements for the lifecycle of a VM. I assume that
> > > all that is needed is what the orchestrator says its doing at the high level.
> > > So, if an orchestrator wants to shutdown a container, the orchestrator must
> > > log that intent and its results. In a similar fashion, systemd logs that it's
> > > killing a service and we don't actually hook the exit syscall of the service
> > > to record that.
> > >
> > > Now, if a container was being used as a VPS, and it had a fully functioning
> > > userspace, it's own services, and its very own audit daemon, then in this
> > > case it would care who sent a signal to its auditd. The tenant of that
> > > container may have to comply with PCI-DSS or something else. It would log the
> > > audit service is being terminated and systemd would record that its tearing
> > > down the environment. The OS doesn't need to do anything.
> >
> > This latter case is the case of interest here, since the host auditd
> > should only be killed from a process on the host itself, not a process
> > running in a container.  If we work under the assumption (and this may
> > be a break in our approach to not defining "container") that an auditd
> > instance is only ever signaled by a process with the same audit
> > container ID (ACID), is this really even an issue?  Right now it isn't
> > as even with this patchset we will still really only support one
> > auditd instance, presumably on the host, so this isn't a significant
> > concern.  Moving forward, once we add support for multiple auditd
> > instances we will likely need to move the signal info into
> > (potentially) s per-ACID struct, a struct whose lifetime would match
> > that of the associated container by definition; as the auditd
> > container died, the struct would die, the refcounts dropped, and any
> > ACID held only the signal info refcount would be dropped/killed.
>
> Any process could signal auditd if it can see it based on namespace
> relationships, nevermind container placement.  Some container
> architectures would not have a namespace configuration that would block
> this (combination of PID/user/IPC?).
>
> > However, making this assumption would mean that we are expecting a
> > "container" to provide some level of isolation such that processes
> > with a different audit container ID do not signal each other.  From a
> > practical perspective I think that fits with the most (all?)
> > definitions of "container", but I can't say that for certain.  In
> > those cases where the assumption is not correct and processes can
> > signal each other across audit container ID boundaries, perhaps it is
> > enough to explain that an audit container ID may not fully disappear
> > until it has been fetched with a SIGNAL_INFO2 message.
>
> I think more and more, that more complete isolation is being done,
> taking advantage of each type of namespace as they become available, but
> I know a nuber of them didn't find it important yet to use IPC, PID or
> user namespaces which would be the only namespaces I can think of that
> would provide that isolation.
>
> It isn't entirely clear to me which side you fall on this issue, Paul.

That's mostly because I was hoping for some clarification in the
discussion, especially the relevant certification requirements, but it
looks like there is still plenty of room for interpretation there (as
usual).  I'd much rather us arrive at decisions based on requirements
and not gut feelings, which is where I think we are at right now.

> Can you pronounce on your strong preference one way or the other if the
> death of a container coincide with the exit of the last process in that
> namespace, or the fetch of any signal info related to it?

"pronounce on your strong preference"?  I've seen you use "pronounce"
a few times now, and suggest a different word in the future; the
connotation is not well received on my end.

> I have a bias
> to the former since the code already does that and I feel the exit of
> the last process is much more relevant supported by the syscall record,
> but could change it to the latter if you feel strongly enough about it
> to block upstream acceptance.

At this point in time I believe the right thing to do is to preserve
the audit container ID as "dead but still in existence" so that there
is no confusion (due to reuse) if/when it finally reappears in the
audit record stream.

The thread has had a lot of starts/stops, so I may be repeating a
previous suggestion, but one idea would be to still emit a "death
record" when the final task in the audit container ID does die, but
block the particular audit container ID from reuse until it the
SIGNAL2 info has been reported.  This gives us the timely ACID death
notification while still preventing confusion and ambiguity caused by
potentially reusing the ACID before the SIGNAL2 record has been sent;
there is a small nit about the ACID being present in the SIGNAL2
*after* its death, but I think that can be easily explained and
understood by admins.
Steve Grubb March 13, 2020, 4:45 p.m. UTC | #15
On Friday, March 13, 2020 12:42:15 PM EDT Paul Moore wrote:
> > I think more and more, that more complete isolation is being done,
> > taking advantage of each type of namespace as they become available, but
> > I know a nuber of them didn't find it important yet to use IPC, PID or
> > user namespaces which would be the only namespaces I can think of that
> > would provide that isolation.
> > 
> > It isn't entirely clear to me which side you fall on this issue, Paul.
> 
> That's mostly because I was hoping for some clarification in the
> discussion, especially the relevant certification requirements, but it
> looks like there is still plenty of room for interpretation there (as
> usual).  I'd much rather us arrive at decisions based on requirements
> and not gut feelings, which is where I think we are at right now.

Certification rquirements are that we need the identity of anyone attempting 
to modify the audit configuration including shutting it down.

-Steve
Paul Moore March 13, 2020, 4:49 p.m. UTC | #16
On Fri, Mar 13, 2020 at 12:45 PM Steve Grubb <sgrubb@redhat.com> wrote:
> On Friday, March 13, 2020 12:42:15 PM EDT Paul Moore wrote:
> > > I think more and more, that more complete isolation is being done,
> > > taking advantage of each type of namespace as they become available, but
> > > I know a nuber of them didn't find it important yet to use IPC, PID or
> > > user namespaces which would be the only namespaces I can think of that
> > > would provide that isolation.
> > >
> > > It isn't entirely clear to me which side you fall on this issue, Paul.
> >
> > That's mostly because I was hoping for some clarification in the
> > discussion, especially the relevant certification requirements, but it
> > looks like there is still plenty of room for interpretation there (as
> > usual).  I'd much rather us arrive at decisions based on requirements
> > and not gut feelings, which is where I think we are at right now.
>
> Certification rquirements are that we need the identity of anyone attempting
> to modify the audit configuration including shutting it down.

Yep, got it.  Unfortunately that doesn't really help with what we are
talking about.  Although preventing the reuse of the ACID before the
SIGNAL2 record does help preserve the sanity of the audit stream which
I believe to be very important, regardless.
Richard Guy Briggs March 13, 2020, 6:59 p.m. UTC | #17
On 2020-03-13 12:29, Paul Moore wrote:
> On Thu, Mar 12, 2020 at 3:30 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> > On 2020-02-13 16:44, Paul Moore wrote:
> > > This is a bit of a thread-hijack, and for that I apologize, but
> > > another thought crossed my mind while thinking about this issue
> > > further ... Once we support multiple auditd instances, including the
> > > necessary record routing and duplication/multiple-sends (the host
> > > always sees *everything*), we will likely need to find a way to "trim"
> > > the audit container ID (ACID) lists we send in the records.  The
> > > auditd instance running on the host/initns will always see everything,
> > > so it will want the full container ACID list; however an auditd
> > > instance running inside a container really should only see the ACIDs
> > > of any child containers.
> >
> > Agreed.  This should be easy to check and limit, preventing an auditd
> > from seeing any contid that is a parent of its own contid.
> >
> > > For example, imagine a system where the host has containers 1 and 2,
> > > each running an auditd instance.  Inside container 1 there are
> > > containers A and B.  Inside container 2 there are containers Y and Z.
> > > If an audit event is generated in container Z, I would expect the
> > > host's auditd to see a ACID list of "1,Z" but container 1's auditd
> > > should only see an ACID list of "Z".  The auditd running in container
> > > 2 should not see the record at all (that will be relatively
> > > straightforward).  Does that make sense?  Do we have the record
> > > formats properly designed to handle this without too much problem (I'm
> > > not entirely sure we do)?
> >
> > I completely agree and I believe we have record formats that are able to
> > handle this already.
> 
> I'm not convinced we do.  What about the cases where we have a field
> with a list of audit container IDs?  How do we handle that?

I don't understand the problem.  (I think you crossed your 1/2 vs
A/B/Y/Z in your example.)  Clarifying the example above, if as you
suggest an event happens in container Z, the hosts's auditd would report
	Z,^2
and the auditd in container 2 would report
	Z,^2
but if there were another auditd running in container Z it would report
	Z
while the auditd in container 1 or A/B would see nothing.

The format I had proposed already handles that:
contid^contid,contid^contid but you'd like to see it changed to
contid,^contid,contid,^contid and both formats handle it though I find
the former much easier to read.  For the example above we'd have:
	A,^1
	B,^1
	Y,^2
	Z,^2
and for a shared network namespace potentially:
	A,^1,B,^1,Y,^2,Z,^2
and if there were an event reported by an auditd in container Z it would
report only:
	Z

Now, I could see an argument for restricting the visibility of the
contid to the container containing an auditd so that an auditd cannot
see its own contid, but that wasn't my design intent.  This can still be
addressed after the initial code is committed without breaking the API.

> paul moore

- RGB

--
Richard Guy Briggs <rgb@redhat.com>
Sr. S/W Engineer, Kernel Security, Base Operating Systems
Remote, Ottawa, Red Hat Canada
IRC: rgb, SunRaycer
Voice: +1.647.777.2635, Internal: (81) 32635
Richard Guy Briggs March 13, 2020, 7:23 p.m. UTC | #18
On 2020-03-13 12:42, Paul Moore wrote:
> On Thu, Mar 12, 2020 at 4:27 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> > On 2020-02-12 19:09, Paul Moore wrote:
> > > On Wed, Feb 12, 2020 at 5:39 PM Steve Grubb <sgrubb@redhat.com> wrote:
> > > > On Wednesday, February 5, 2020 5:50:28 PM EST Paul Moore wrote:
> > > > > > > > > ... When we record the audit container ID in audit_signal_info() we
> > > > > > > > > take an extra reference to the audit container ID object so that it
> > > > > > > > > will not disappear (and get reused) until after we respond with an
> > > > > > > > > AUDIT_SIGNAL_INFO2.  In audit_receive_msg() when we do the
> > > > > > > > > AUDIT_SIGNAL_INFO2 processing we drop the extra reference we took
> > > > > > > > > in
> > > > > > > > > audit_signal_info().  Unless I'm missing some other change you
> > > > > > > > > made,
> > > > > > > > > this *shouldn't* affect the syscall records, all it does is
> > > > > > > > > preserve
> > > > > > > > > the audit container ID object in the kernel's ACID store so it
> > > > > > > > > doesn't
> > > > > > > > > get reused.
> > > > > > > >
> > > > > > > > This is exactly what I had understood.  I hadn't considered the extra
> > > > > > > > details below in detail due to my original syscall concern, but they
> > > > > > > > make sense.
> > > > > > > >
> > > > > > > > The syscall I refer to is the one connected with the drop of the
> > > > > > > > audit container identifier by the last process that was in that
> > > > > > > > container in patch 5/16.  The production of this record is contingent
> > > > > > > > on
> > > > > > > > the last ref in a contobj being dropped.  So if it is due to that ref
> > > > > > > > being maintained by audit_signal_info() until the AUDIT_SIGNAL_INFO2
> > > > > > > > record it fetched, then it will appear that the fetch action closed
> > > > > > > > the
> > > > > > > > container rather than the last process in the container to exit.
> > > > > > > >
> > > > > > > > Does this make sense?
> > > > > > >
> > > > > > > More so than your original reply, at least to me anyway.
> > > > > > >
> > > > > > > It makes sense that the audit container ID wouldn't be marked as
> > > > > > > "dead" since it would still be very much alive and available for use
> > > > > > > by the orchestrator, the question is if that is desirable or not.  I
> > > > > > > think the answer to this comes down the preserving the correctness of
> > > > > > > the audit log.
> > > > > > >
> > > > > > > If the audit container ID reported by AUDIT_SIGNAL_INFO2 has been
> > > > > > > reused then I think there is a legitimate concern that the audit log
> > > > > > > is not correct, and could be misleading.  If we solve that by grabbing
> > > > > > > an extra reference, then there could also be some confusion as
> > > > > > > userspace considers a container to be "dead" while the audit container
> > > > > > > ID still exists in the kernel, and the kernel generated audit
> > > > > > > container ID death record will not be generated until much later (and
> > > > > > > possibly be associated with a different event, but that could be
> > > > > > > solved by unassociating the container death record).
> > > > > >
> > > > > > How does syscall association of the death record with AUDIT_SIGNAL_INFO2
> > > > > > possibly get associated with another event?  Or is the syscall
> > > > > > association with the fetch for the AUDIT_SIGNAL_INFO2 the other event?
> > > > >
> > > > > The issue is when does the audit container ID "die".  If it is when
> > > > > the last task in the container exits, then the death record will be
> > > > > associated when the task's exit.  If the audit container ID lives on
> > > > > until the last reference of it in the audit logs, including the
> > > > > SIGNAL_INFO2 message, the death record will be associated with the
> > > > > related SIGNAL_INFO2 syscalls, or perhaps unassociated depending on
> > > > > the details of the syscalls/netlink.
> > > > >
> > > > > > Another idea might be to bump the refcount in audit_signal_info() but
> > > > > > mark tht contid as dead so it can't be reused if we are concerned that
> > > > > > the dead contid be reused?
> > > > >
> > > > > Ooof.  Yes, maybe, but that would be ugly.
> > > > >
> > > > > > There is still the problem later that the reported contid is incomplete
> > > > > > compared to the rest of the contid reporting cycle wrt nesting since
> > > > > > AUDIT_SIGNAL_INFO2 will need to be more complex w/2 variable length
> > > > > > fields to accommodate a nested contid list.
> > > > >
> > > > > Do we really care about the full nested audit container ID list in the
> > > > > SIGNAL_INFO2 record?
> >
> > I'm inclined to hand-wave it away as inconvenient that can be looked up
> > more carefully if it is really needed.  Maybe the block above would be
> > safer and more complete even though it is ugly.
> >
> > > > > > > Of the two
> > > > > > > approaches, I think the latter is safer in that it preserves the
> > > > > > > correctness of the audit log, even though it could result in a delay
> > > > > > > of the container death record.
> > > > > >
> > > > > > I prefer the former since it strongly indicates last task in the
> > > > > > container.  The AUDIT_SIGNAL_INFO2 msg has the pid and other subject
> > > > > > attributes and the contid to strongly link the responsible party.
> > > > >
> > > > > Steve is the only one who really tracks the security certifications
> > > > > that are relevant to audit, see what the certification requirements
> > > > > have to say and we can revisit this.
> > > >
> > > > Sever Virtualization Protection Profile is the closest applicable standard
> > > >
> > > > https://www.niap-ccevs.org/Profile/Info.cfm?PPID=408&id=408
> > > >
> > > > It is silent on audit requirements for the lifecycle of a VM. I assume that
> > > > all that is needed is what the orchestrator says its doing at the high level.
> > > > So, if an orchestrator wants to shutdown a container, the orchestrator must
> > > > log that intent and its results. In a similar fashion, systemd logs that it's
> > > > killing a service and we don't actually hook the exit syscall of the service
> > > > to record that.
> > > >
> > > > Now, if a container was being used as a VPS, and it had a fully functioning
> > > > userspace, it's own services, and its very own audit daemon, then in this
> > > > case it would care who sent a signal to its auditd. The tenant of that
> > > > container may have to comply with PCI-DSS or something else. It would log the
> > > > audit service is being terminated and systemd would record that its tearing
> > > > down the environment. The OS doesn't need to do anything.
> > >
> > > This latter case is the case of interest here, since the host auditd
> > > should only be killed from a process on the host itself, not a process
> > > running in a container.  If we work under the assumption (and this may
> > > be a break in our approach to not defining "container") that an auditd
> > > instance is only ever signaled by a process with the same audit
> > > container ID (ACID), is this really even an issue?  Right now it isn't
> > > as even with this patchset we will still really only support one
> > > auditd instance, presumably on the host, so this isn't a significant
> > > concern.  Moving forward, once we add support for multiple auditd
> > > instances we will likely need to move the signal info into
> > > (potentially) s per-ACID struct, a struct whose lifetime would match
> > > that of the associated container by definition; as the auditd
> > > container died, the struct would die, the refcounts dropped, and any
> > > ACID held only the signal info refcount would be dropped/killed.
> >
> > Any process could signal auditd if it can see it based on namespace
> > relationships, nevermind container placement.  Some container
> > architectures would not have a namespace configuration that would block
> > this (combination of PID/user/IPC?).
> >
> > > However, making this assumption would mean that we are expecting a
> > > "container" to provide some level of isolation such that processes
> > > with a different audit container ID do not signal each other.  From a
> > > practical perspective I think that fits with the most (all?)
> > > definitions of "container", but I can't say that for certain.  In
> > > those cases where the assumption is not correct and processes can
> > > signal each other across audit container ID boundaries, perhaps it is
> > > enough to explain that an audit container ID may not fully disappear
> > > until it has been fetched with a SIGNAL_INFO2 message.
> >
> > I think more and more, that more complete isolation is being done,
> > taking advantage of each type of namespace as they become available, but
> > I know a nuber of them didn't find it important yet to use IPC, PID or
> > user namespaces which would be the only namespaces I can think of that
> > would provide that isolation.
> >
> > It isn't entirely clear to me which side you fall on this issue, Paul.
> 
> That's mostly because I was hoping for some clarification in the
> discussion, especially the relevant certification requirements, but it
> looks like there is still plenty of room for interpretation there (as
> usual).  I'd much rather us arrive at decisions based on requirements
> and not gut feelings, which is where I think we are at right now.

I don't disagree.

> > Can you pronounce on your strong preference one way or the other if the
> > death of a container coincide with the exit of the last process in that
> > namespace, or the fetch of any signal info related to it?
> 
> "pronounce on your strong preference"?  I've seen you use "pronounce"
> a few times now, and suggest a different word in the future; the
> connotation is not well received on my end.

I'm sorry.  I don't have any particular attachment to that word, but
I'll try to be concious to avoid it since you've expressed your aversion
to it.  I don't mean to load it down with any negative connotations, I'm
simply seeking clarity on your preferred technical style so I may follow
it.

> > I have a bias
> > to the former since the code already does that and I feel the exit of
> > the last process is much more relevant supported by the syscall record,
> > but could change it to the latter if you feel strongly enough about it
> > to block upstream acceptance.
> 
> At this point in time I believe the right thing to do is to preserve
> the audit container ID as "dead but still in existence" so that there
> is no confusion (due to reuse) if/when it finally reappears in the
> audit record stream.

I agree this seems safest.

> The thread has had a lot of starts/stops, so I may be repeating a
> previous suggestion, but one idea would be to still emit a "death
> record" when the final task in the audit container ID does die, but
> block the particular audit container ID from reuse until it the
> SIGNAL2 info has been reported.  This gives us the timely ACID death
> notification while still preventing confusion and ambiguity caused by
> potentially reusing the ACID before the SIGNAL2 record has been sent;
> there is a small nit about the ACID being present in the SIGNAL2
> *after* its death, but I think that can be easily explained and
> understood by admins.

Thinking quickly about possible technical solutions to this, maybe it
makes sense to have two counters on a contobj so that we know when the
last process in that container exits and can issue the death
certificate, but we still block reuse of it until all further references
to it have been resolved.  This will likely also make it possible to
report the full contid chain in SIGNAL2 records.  This will eliminate
some of the issues we are discussing with regards to passing a contobj
vs a contid to the audit_log_contid function, but won't eliminate them
all because there are still some contids that won't have an object
associated with them to make it impossible to look them up in the
contobj lists.

> paul moore

- RGB

--
Richard Guy Briggs <rgb@redhat.com>
Sr. S/W Engineer, Kernel Security, Base Operating Systems
Remote, Ottawa, Red Hat Canada
IRC: rgb, SunRaycer
Voice: +1.647.777.2635, Internal: (81) 32635
Paul Moore March 18, 2020, 8:56 p.m. UTC | #19
On Fri, Mar 13, 2020 at 2:59 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> On 2020-03-13 12:29, Paul Moore wrote:
> > On Thu, Mar 12, 2020 at 3:30 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> > > On 2020-02-13 16:44, Paul Moore wrote:
> > > > This is a bit of a thread-hijack, and for that I apologize, but
> > > > another thought crossed my mind while thinking about this issue
> > > > further ... Once we support multiple auditd instances, including the
> > > > necessary record routing and duplication/multiple-sends (the host
> > > > always sees *everything*), we will likely need to find a way to "trim"
> > > > the audit container ID (ACID) lists we send in the records.  The
> > > > auditd instance running on the host/initns will always see everything,
> > > > so it will want the full container ACID list; however an auditd
> > > > instance running inside a container really should only see the ACIDs
> > > > of any child containers.
> > >
> > > Agreed.  This should be easy to check and limit, preventing an auditd
> > > from seeing any contid that is a parent of its own contid.
> > >
> > > > For example, imagine a system where the host has containers 1 and 2,
> > > > each running an auditd instance.  Inside container 1 there are
> > > > containers A and B.  Inside container 2 there are containers Y and Z.
> > > > If an audit event is generated in container Z, I would expect the
> > > > host's auditd to see a ACID list of "1,Z" but container 1's auditd
> > > > should only see an ACID list of "Z".  The auditd running in container
> > > > 2 should not see the record at all (that will be relatively
> > > > straightforward).  Does that make sense?  Do we have the record
> > > > formats properly designed to handle this without too much problem (I'm
> > > > not entirely sure we do)?
> > >
> > > I completely agree and I believe we have record formats that are able to
> > > handle this already.
> >
> > I'm not convinced we do.  What about the cases where we have a field
> > with a list of audit container IDs?  How do we handle that?
>
> I don't understand the problem.  (I think you crossed your 1/2 vs
> A/B/Y/Z in your example.) ...

It looks like I did, sorry about that.

> ... Clarifying the example above, if as you
> suggest an event happens in container Z, the hosts's auditd would report
>         Z,^2
> and the auditd in container 2 would report
>         Z,^2
> but if there were another auditd running in container Z it would report
>         Z
> while the auditd in container 1 or A/B would see nothing.

Yes.  My concern is how do we handle this to minimize duplicating and
rewriting the records?  It isn't so much about the format, although
the format is a side effect.
Paul Moore March 18, 2020, 9:01 p.m. UTC | #20
On Fri, Mar 13, 2020 at 3:23 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> On 2020-03-13 12:42, Paul Moore wrote:

...

> > The thread has had a lot of starts/stops, so I may be repeating a
> > previous suggestion, but one idea would be to still emit a "death
> > record" when the final task in the audit container ID does die, but
> > block the particular audit container ID from reuse until it the
> > SIGNAL2 info has been reported.  This gives us the timely ACID death
> > notification while still preventing confusion and ambiguity caused by
> > potentially reusing the ACID before the SIGNAL2 record has been sent;
> > there is a small nit about the ACID being present in the SIGNAL2
> > *after* its death, but I think that can be easily explained and
> > understood by admins.
>
> Thinking quickly about possible technical solutions to this, maybe it
> makes sense to have two counters on a contobj so that we know when the
> last process in that container exits and can issue the death
> certificate, but we still block reuse of it until all further references
> to it have been resolved.  This will likely also make it possible to
> report the full contid chain in SIGNAL2 records.  This will eliminate
> some of the issues we are discussing with regards to passing a contobj
> vs a contid to the audit_log_contid function, but won't eliminate them
> all because there are still some contids that won't have an object
> associated with them to make it impossible to look them up in the
> contobj lists.

I'm not sure you need a full second counter, I imagine a simple flag
would be okay.  I think you just something to indicate that this ACID
object is marked as "dead" but it still being held for sanity reasons
and should not be reused.
Richard Guy Briggs March 18, 2020, 9:26 p.m. UTC | #21
On 2020-03-18 16:56, Paul Moore wrote:
> On Fri, Mar 13, 2020 at 2:59 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> > On 2020-03-13 12:29, Paul Moore wrote:
> > > On Thu, Mar 12, 2020 at 3:30 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> > > > On 2020-02-13 16:44, Paul Moore wrote:
> > > > > This is a bit of a thread-hijack, and for that I apologize, but
> > > > > another thought crossed my mind while thinking about this issue
> > > > > further ... Once we support multiple auditd instances, including the
> > > > > necessary record routing and duplication/multiple-sends (the host
> > > > > always sees *everything*), we will likely need to find a way to "trim"
> > > > > the audit container ID (ACID) lists we send in the records.  The
> > > > > auditd instance running on the host/initns will always see everything,
> > > > > so it will want the full container ACID list; however an auditd
> > > > > instance running inside a container really should only see the ACIDs
> > > > > of any child containers.
> > > >
> > > > Agreed.  This should be easy to check and limit, preventing an auditd
> > > > from seeing any contid that is a parent of its own contid.
> > > >
> > > > > For example, imagine a system where the host has containers 1 and 2,
> > > > > each running an auditd instance.  Inside container 1 there are
> > > > > containers A and B.  Inside container 2 there are containers Y and Z.
> > > > > If an audit event is generated in container Z, I would expect the
> > > > > host's auditd to see a ACID list of "1,Z" but container 1's auditd
> > > > > should only see an ACID list of "Z".  The auditd running in container
> > > > > 2 should not see the record at all (that will be relatively
> > > > > straightforward).  Does that make sense?  Do we have the record
> > > > > formats properly designed to handle this without too much problem (I'm
> > > > > not entirely sure we do)?
> > > >
> > > > I completely agree and I believe we have record formats that are able to
> > > > handle this already.
> > >
> > > I'm not convinced we do.  What about the cases where we have a field
> > > with a list of audit container IDs?  How do we handle that?
> >
> > I don't understand the problem.  (I think you crossed your 1/2 vs
> > A/B/Y/Z in your example.) ...
> 
> It looks like I did, sorry about that.
> 
> > ... Clarifying the example above, if as you
> > suggest an event happens in container Z, the hosts's auditd would report
> >         Z,^2
> > and the auditd in container 2 would report
> >         Z,^2
> > but if there were another auditd running in container Z it would report
> >         Z
> > while the auditd in container 1 or A/B would see nothing.
> 
> Yes.  My concern is how do we handle this to minimize duplicating and
> rewriting the records?  It isn't so much about the format, although
> the format is a side effect.

Are you talking about caching, or about divulging more information than
necessary or even information leaks?  Or even noticing that records that
need to be generated to two audit daemons share the same contid field
values and should be generated at the same time or information shared
between them?  I'd see any of these as optimizations that don't affect
the api.

> paul moore

- RGB

--
Richard Guy Briggs <rgb@redhat.com>
Sr. S/W Engineer, Kernel Security, Base Operating Systems
Remote, Ottawa, Red Hat Canada
IRC: rgb, SunRaycer
Voice: +1.647.777.2635, Internal: (81) 32635
Richard Guy Briggs March 18, 2020, 9:41 p.m. UTC | #22
On 2020-03-18 17:01, Paul Moore wrote:
> On Fri, Mar 13, 2020 at 3:23 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> > On 2020-03-13 12:42, Paul Moore wrote:
> 
> ...
> 
> > > The thread has had a lot of starts/stops, so I may be repeating a
> > > previous suggestion, but one idea would be to still emit a "death
> > > record" when the final task in the audit container ID does die, but
> > > block the particular audit container ID from reuse until it the
> > > SIGNAL2 info has been reported.  This gives us the timely ACID death
> > > notification while still preventing confusion and ambiguity caused by
> > > potentially reusing the ACID before the SIGNAL2 record has been sent;
> > > there is a small nit about the ACID being present in the SIGNAL2
> > > *after* its death, but I think that can be easily explained and
> > > understood by admins.
> >
> > Thinking quickly about possible technical solutions to this, maybe it
> > makes sense to have two counters on a contobj so that we know when the
> > last process in that container exits and can issue the death
> > certificate, but we still block reuse of it until all further references
> > to it have been resolved.  This will likely also make it possible to
> > report the full contid chain in SIGNAL2 records.  This will eliminate
> > some of the issues we are discussing with regards to passing a contobj
> > vs a contid to the audit_log_contid function, but won't eliminate them
> > all because there are still some contids that won't have an object
> > associated with them to make it impossible to look them up in the
> > contobj lists.
> 
> I'm not sure you need a full second counter, I imagine a simple flag
> would be okay.  I think you just something to indicate that this ACID
> object is marked as "dead" but it still being held for sanity reasons
> and should not be reused.

Ok, I see your point.  This refcount can be changed to a flag easily
enough without change to the api if we can be sure that more than one
signal can't be delivered to the audit daemon *and* collected by sig2.
I'll have a more careful look at the audit daemon code to see if I can
determine this.

Steve, can you have a look and tell us if it is possible for the audit
daemon to make more than one signal_info (or signal_info2) record
request from the kernel after receiving a signal?


Another question occurs to me is that what if the audit daemon is sent a
signal and it cannot or will not collect the sig2 information from the
kernel (SIGKILL?)?  Does that audit container identifier remain dead
until reboot, or do we institute some other form of reaping, possibly
time-based?


> paul moore

- RGB

--
Richard Guy Briggs <rgb@redhat.com>
Sr. S/W Engineer, Kernel Security, Base Operating Systems
Remote, Ottawa, Red Hat Canada
IRC: rgb, SunRaycer
Voice: +1.647.777.2635, Internal: (81) 32635
Paul Moore March 18, 2020, 9:42 p.m. UTC | #23
On Wed, Mar 18, 2020 at 5:27 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> On 2020-03-18 16:56, Paul Moore wrote:
> > On Fri, Mar 13, 2020 at 2:59 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> > > On 2020-03-13 12:29, Paul Moore wrote:
> > > > On Thu, Mar 12, 2020 at 3:30 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> > > > > On 2020-02-13 16:44, Paul Moore wrote:
> > > > > > This is a bit of a thread-hijack, and for that I apologize, but
> > > > > > another thought crossed my mind while thinking about this issue
> > > > > > further ... Once we support multiple auditd instances, including the
> > > > > > necessary record routing and duplication/multiple-sends (the host
> > > > > > always sees *everything*), we will likely need to find a way to "trim"
> > > > > > the audit container ID (ACID) lists we send in the records.  The
> > > > > > auditd instance running on the host/initns will always see everything,
> > > > > > so it will want the full container ACID list; however an auditd
> > > > > > instance running inside a container really should only see the ACIDs
> > > > > > of any child containers.
> > > > >
> > > > > Agreed.  This should be easy to check and limit, preventing an auditd
> > > > > from seeing any contid that is a parent of its own contid.
> > > > >
> > > > > > For example, imagine a system where the host has containers 1 and 2,
> > > > > > each running an auditd instance.  Inside container 1 there are
> > > > > > containers A and B.  Inside container 2 there are containers Y and Z.
> > > > > > If an audit event is generated in container Z, I would expect the
> > > > > > host's auditd to see a ACID list of "1,Z" but container 1's auditd
> > > > > > should only see an ACID list of "Z".  The auditd running in container
> > > > > > 2 should not see the record at all (that will be relatively
> > > > > > straightforward).  Does that make sense?  Do we have the record
> > > > > > formats properly designed to handle this without too much problem (I'm
> > > > > > not entirely sure we do)?
> > > > >
> > > > > I completely agree and I believe we have record formats that are able to
> > > > > handle this already.
> > > >
> > > > I'm not convinced we do.  What about the cases where we have a field
> > > > with a list of audit container IDs?  How do we handle that?
> > >
> > > I don't understand the problem.  (I think you crossed your 1/2 vs
> > > A/B/Y/Z in your example.) ...
> >
> > It looks like I did, sorry about that.
> >
> > > ... Clarifying the example above, if as you
> > > suggest an event happens in container Z, the hosts's auditd would report
> > >         Z,^2
> > > and the auditd in container 2 would report
> > >         Z,^2
> > > but if there were another auditd running in container Z it would report
> > >         Z
> > > while the auditd in container 1 or A/B would see nothing.
> >
> > Yes.  My concern is how do we handle this to minimize duplicating and
> > rewriting the records?  It isn't so much about the format, although
> > the format is a side effect.
>
> Are you talking about caching, or about divulging more information than
> necessary or even information leaks?  Or even noticing that records that
> need to be generated to two audit daemons share the same contid field
> values and should be generated at the same time or information shared
> between them?  I'd see any of these as optimizations that don't affect
> the api.

Imagine a record is generated in a container which has more than one
auditd in it's ancestry that should receive this record, how do we
handle that without completely killing performance?  That's my
concern.  If you've already thought up a plan for this - excellent,
please share :)
Paul Moore March 18, 2020, 9:47 p.m. UTC | #24
On Wed, Mar 18, 2020 at 5:42 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> On 2020-03-18 17:01, Paul Moore wrote:
> > On Fri, Mar 13, 2020 at 3:23 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> > > On 2020-03-13 12:42, Paul Moore wrote:
> >
> > ...
> >
> > > > The thread has had a lot of starts/stops, so I may be repeating a
> > > > previous suggestion, but one idea would be to still emit a "death
> > > > record" when the final task in the audit container ID does die, but
> > > > block the particular audit container ID from reuse until it the
> > > > SIGNAL2 info has been reported.  This gives us the timely ACID death
> > > > notification while still preventing confusion and ambiguity caused by
> > > > potentially reusing the ACID before the SIGNAL2 record has been sent;
> > > > there is a small nit about the ACID being present in the SIGNAL2
> > > > *after* its death, but I think that can be easily explained and
> > > > understood by admins.
> > >
> > > Thinking quickly about possible technical solutions to this, maybe it
> > > makes sense to have two counters on a contobj so that we know when the
> > > last process in that container exits and can issue the death
> > > certificate, but we still block reuse of it until all further references
> > > to it have been resolved.  This will likely also make it possible to
> > > report the full contid chain in SIGNAL2 records.  This will eliminate
> > > some of the issues we are discussing with regards to passing a contobj
> > > vs a contid to the audit_log_contid function, but won't eliminate them
> > > all because there are still some contids that won't have an object
> > > associated with them to make it impossible to look them up in the
> > > contobj lists.
> >
> > I'm not sure you need a full second counter, I imagine a simple flag
> > would be okay.  I think you just something to indicate that this ACID
> > object is marked as "dead" but it still being held for sanity reasons
> > and should not be reused.
>
> Ok, I see your point.  This refcount can be changed to a flag easily
> enough without change to the api if we can be sure that more than one
> signal can't be delivered to the audit daemon *and* collected by sig2.
> I'll have a more careful look at the audit daemon code to see if I can
> determine this.

Maybe I'm not understanding your concern, but this isn't really
different than any of the other things we track for the auditd signal
sender, right?  If we are worried about multiple signals being sent
then it applies to everything, not just the audit container ID.

> Another question occurs to me is that what if the audit daemon is sent a
> signal and it cannot or will not collect the sig2 information from the
> kernel (SIGKILL?)?  Does that audit container identifier remain dead
> until reboot, or do we institute some other form of reaping, possibly
> time-based?

In order to preserve the integrity of the audit log that ACID value
would need to remain unavailable until the ACID which contains the
associated auditd is "dead" (no one can request the signal sender's
info if that container is dead).
Richard Guy Briggs March 18, 2020, 9:55 p.m. UTC | #25
On 2020-03-18 17:42, Paul Moore wrote:
> On Wed, Mar 18, 2020 at 5:27 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> > On 2020-03-18 16:56, Paul Moore wrote:
> > > On Fri, Mar 13, 2020 at 2:59 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> > > > On 2020-03-13 12:29, Paul Moore wrote:
> > > > > On Thu, Mar 12, 2020 at 3:30 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> > > > > > On 2020-02-13 16:44, Paul Moore wrote:
> > > > > > > This is a bit of a thread-hijack, and for that I apologize, but
> > > > > > > another thought crossed my mind while thinking about this issue
> > > > > > > further ... Once we support multiple auditd instances, including the
> > > > > > > necessary record routing and duplication/multiple-sends (the host
> > > > > > > always sees *everything*), we will likely need to find a way to "trim"
> > > > > > > the audit container ID (ACID) lists we send in the records.  The
> > > > > > > auditd instance running on the host/initns will always see everything,
> > > > > > > so it will want the full container ACID list; however an auditd
> > > > > > > instance running inside a container really should only see the ACIDs
> > > > > > > of any child containers.
> > > > > >
> > > > > > Agreed.  This should be easy to check and limit, preventing an auditd
> > > > > > from seeing any contid that is a parent of its own contid.
> > > > > >
> > > > > > > For example, imagine a system where the host has containers 1 and 2,
> > > > > > > each running an auditd instance.  Inside container 1 there are
> > > > > > > containers A and B.  Inside container 2 there are containers Y and Z.
> > > > > > > If an audit event is generated in container Z, I would expect the
> > > > > > > host's auditd to see a ACID list of "1,Z" but container 1's auditd
> > > > > > > should only see an ACID list of "Z".  The auditd running in container
> > > > > > > 2 should not see the record at all (that will be relatively
> > > > > > > straightforward).  Does that make sense?  Do we have the record
> > > > > > > formats properly designed to handle this without too much problem (I'm
> > > > > > > not entirely sure we do)?
> > > > > >
> > > > > > I completely agree and I believe we have record formats that are able to
> > > > > > handle this already.
> > > > >
> > > > > I'm not convinced we do.  What about the cases where we have a field
> > > > > with a list of audit container IDs?  How do we handle that?
> > > >
> > > > I don't understand the problem.  (I think you crossed your 1/2 vs
> > > > A/B/Y/Z in your example.) ...
> > >
> > > It looks like I did, sorry about that.
> > >
> > > > ... Clarifying the example above, if as you
> > > > suggest an event happens in container Z, the hosts's auditd would report
> > > >         Z,^2
> > > > and the auditd in container 2 would report
> > > >         Z,^2
> > > > but if there were another auditd running in container Z it would report
> > > >         Z
> > > > while the auditd in container 1 or A/B would see nothing.
> > >
> > > Yes.  My concern is how do we handle this to minimize duplicating and
> > > rewriting the records?  It isn't so much about the format, although
> > > the format is a side effect.
> >
> > Are you talking about caching, or about divulging more information than
> > necessary or even information leaks?  Or even noticing that records that
> > need to be generated to two audit daemons share the same contid field
> > values and should be generated at the same time or information shared
> > between them?  I'd see any of these as optimizations that don't affect
> > the api.
> 
> Imagine a record is generated in a container which has more than one
> auditd in it's ancestry that should receive this record, how do we
> handle that without completely killing performance?  That's my
> concern.  If you've already thought up a plan for this - excellent,
> please share :)

No, I haven't given that much thought other than the correctness and
security issues of making sure that each audit daemon is sufficiently
isolated to do its job but not jeopardize another audit domain.  Audit
already kills performance, according to some...

We currently won't have that problem since there can only be one so far.
Fixing and optimizing this is part of the next phase of the challenge of
adding a second audit daemon.

Let's work on correctness and reasonable efficiency for this phase and
not focus on a problem we don't yet have.  I wouldn't consider this
incurring technical debt at this point.

I could see cacheing a contid string from one starting point, but it may
be more work to search that cached string to truncate it or add to it
when another audit daemon requests a copy of a similar string.  I
suppose every full contid string could be generated the first time it is
used and parts of it used (start/finish) as needed but that
search/indexing may not be worth it.

> paul moore

- RGB

--
Richard Guy Briggs <rgb@redhat.com>
Sr. S/W Engineer, Kernel Security, Base Operating Systems
Remote, Ottawa, Red Hat Canada
IRC: rgb, SunRaycer
Voice: +1.647.777.2635, Internal: (81) 32635
Paul Moore March 18, 2020, 10:06 p.m. UTC | #26
On Wed, Mar 18, 2020 at 5:56 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> On 2020-03-18 17:42, Paul Moore wrote:
> > On Wed, Mar 18, 2020 at 5:27 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> > > On 2020-03-18 16:56, Paul Moore wrote:
> > > > On Fri, Mar 13, 2020 at 2:59 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> > > > > On 2020-03-13 12:29, Paul Moore wrote:
> > > > > > On Thu, Mar 12, 2020 at 3:30 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> > > > > > > On 2020-02-13 16:44, Paul Moore wrote:
> > > > > > > > This is a bit of a thread-hijack, and for that I apologize, but
> > > > > > > > another thought crossed my mind while thinking about this issue
> > > > > > > > further ... Once we support multiple auditd instances, including the
> > > > > > > > necessary record routing and duplication/multiple-sends (the host
> > > > > > > > always sees *everything*), we will likely need to find a way to "trim"
> > > > > > > > the audit container ID (ACID) lists we send in the records.  The
> > > > > > > > auditd instance running on the host/initns will always see everything,
> > > > > > > > so it will want the full container ACID list; however an auditd
> > > > > > > > instance running inside a container really should only see the ACIDs
> > > > > > > > of any child containers.
> > > > > > >
> > > > > > > Agreed.  This should be easy to check and limit, preventing an auditd
> > > > > > > from seeing any contid that is a parent of its own contid.
> > > > > > >
> > > > > > > > For example, imagine a system where the host has containers 1 and 2,
> > > > > > > > each running an auditd instance.  Inside container 1 there are
> > > > > > > > containers A and B.  Inside container 2 there are containers Y and Z.
> > > > > > > > If an audit event is generated in container Z, I would expect the
> > > > > > > > host's auditd to see a ACID list of "1,Z" but container 1's auditd
> > > > > > > > should only see an ACID list of "Z".  The auditd running in container
> > > > > > > > 2 should not see the record at all (that will be relatively
> > > > > > > > straightforward).  Does that make sense?  Do we have the record
> > > > > > > > formats properly designed to handle this without too much problem (I'm
> > > > > > > > not entirely sure we do)?
> > > > > > >
> > > > > > > I completely agree and I believe we have record formats that are able to
> > > > > > > handle this already.
> > > > > >
> > > > > > I'm not convinced we do.  What about the cases where we have a field
> > > > > > with a list of audit container IDs?  How do we handle that?
> > > > >
> > > > > I don't understand the problem.  (I think you crossed your 1/2 vs
> > > > > A/B/Y/Z in your example.) ...
> > > >
> > > > It looks like I did, sorry about that.
> > > >
> > > > > ... Clarifying the example above, if as you
> > > > > suggest an event happens in container Z, the hosts's auditd would report
> > > > >         Z,^2
> > > > > and the auditd in container 2 would report
> > > > >         Z,^2
> > > > > but if there were another auditd running in container Z it would report
> > > > >         Z
> > > > > while the auditd in container 1 or A/B would see nothing.
> > > >
> > > > Yes.  My concern is how do we handle this to minimize duplicating and
> > > > rewriting the records?  It isn't so much about the format, although
> > > > the format is a side effect.
> > >
> > > Are you talking about caching, or about divulging more information than
> > > necessary or even information leaks?  Or even noticing that records that
> > > need to be generated to two audit daemons share the same contid field
> > > values and should be generated at the same time or information shared
> > > between them?  I'd see any of these as optimizations that don't affect
> > > the api.
> >
> > Imagine a record is generated in a container which has more than one
> > auditd in it's ancestry that should receive this record, how do we
> > handle that without completely killing performance?  That's my
> > concern.  If you've already thought up a plan for this - excellent,
> > please share :)
>
> No, I haven't given that much thought other than the correctness and
> security issues of making sure that each audit daemon is sufficiently
> isolated to do its job but not jeopardize another audit domain.  Audit
> already kills performance, according to some...
>
> We currently won't have that problem since there can only be one so far.
> Fixing and optimizing this is part of the next phase of the challenge of
> adding a second audit daemon.
>
> Let's work on correctness and reasonable efficiency for this phase and
> not focus on a problem we don't yet have.  I wouldn't consider this
> incurring technical debt at this point.

I agree, one stage at a time, but the choice we make here is going to
have a significant impact on what we can do later.  We need to get
this as "right" as possible; this isn't something we should dismiss
with a hand-wave as a problem for the next stage.  We don't need an
implementation, but I would like to see a rough design of how we would
address this problem.

> I could see cacheing a contid string from one starting point, but it may
> be more work to search that cached string to truncate it or add to it
> when another audit daemon requests a copy of a similar string.  I
> suppose every full contid string could be generated the first time it is
> used and parts of it used (start/finish) as needed but that
> search/indexing may not be worth it.

I hope we can do better than string manipulations in the kernel.  I'd
much rather defer generating the ACID list (if possible), than
generating a list only to keep copying and editing it as the record is
sent.
Richard Guy Briggs March 19, 2020, 9:47 p.m. UTC | #27
On 2020-03-18 17:47, Paul Moore wrote:
> On Wed, Mar 18, 2020 at 5:42 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> > On 2020-03-18 17:01, Paul Moore wrote:
> > > On Fri, Mar 13, 2020 at 3:23 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> > > > On 2020-03-13 12:42, Paul Moore wrote:
> > >
> > > ...
> > >
> > > > > The thread has had a lot of starts/stops, so I may be repeating a
> > > > > previous suggestion, but one idea would be to still emit a "death
> > > > > record" when the final task in the audit container ID does die, but
> > > > > block the particular audit container ID from reuse until it the
> > > > > SIGNAL2 info has been reported.  This gives us the timely ACID death
> > > > > notification while still preventing confusion and ambiguity caused by
> > > > > potentially reusing the ACID before the SIGNAL2 record has been sent;
> > > > > there is a small nit about the ACID being present in the SIGNAL2
> > > > > *after* its death, but I think that can be easily explained and
> > > > > understood by admins.
> > > >
> > > > Thinking quickly about possible technical solutions to this, maybe it
> > > > makes sense to have two counters on a contobj so that we know when the
> > > > last process in that container exits and can issue the death
> > > > certificate, but we still block reuse of it until all further references
> > > > to it have been resolved.  This will likely also make it possible to
> > > > report the full contid chain in SIGNAL2 records.  This will eliminate
> > > > some of the issues we are discussing with regards to passing a contobj
> > > > vs a contid to the audit_log_contid function, but won't eliminate them
> > > > all because there are still some contids that won't have an object
> > > > associated with them to make it impossible to look them up in the
> > > > contobj lists.
> > >
> > > I'm not sure you need a full second counter, I imagine a simple flag
> > > would be okay.  I think you just something to indicate that this ACID
> > > object is marked as "dead" but it still being held for sanity reasons
> > > and should not be reused.
> >
> > Ok, I see your point.  This refcount can be changed to a flag easily
> > enough without change to the api if we can be sure that more than one
> > signal can't be delivered to the audit daemon *and* collected by sig2.
> > I'll have a more careful look at the audit daemon code to see if I can
> > determine this.
> 
> Maybe I'm not understanding your concern, but this isn't really
> different than any of the other things we track for the auditd signal
> sender, right?  If we are worried about multiple signals being sent
> then it applies to everything, not just the audit container ID.

Yes, you are right.  In all other cases the information is simply
overwritten.  In the case of the audit container identifier any
previous value is put before a new one is referenced, so only the last
signal is kept.  So, we only need a flag.  Does a flag implemented with
a rcu-protected refcount sound reasonable to you?

> > Another question occurs to me is that what if the audit daemon is sent a
> > signal and it cannot or will not collect the sig2 information from the
> > kernel (SIGKILL?)?  Does that audit container identifier remain dead
> > until reboot, or do we institute some other form of reaping, possibly
> > time-based?
> 
> In order to preserve the integrity of the audit log that ACID value
> would need to remain unavailable until the ACID which contains the
> associated auditd is "dead" (no one can request the signal sender's
> info if that container is dead).

I don't understand why it would be associated with the contid of the
audit daemon process rather than with the audit daemon process itself.
How does the signal collection somehow get transferred or delegated to
another member of that audit daemon's container?

Thinking aloud here, the audit daemon's exit when it calls audit_free()
needs to ..._put_sig and cancel that audit_sig_cid (which in the future
will be allocated per auditd rather than the global it is now since
there is only one audit daemon).

> paul moore

- RGB

--
Richard Guy Briggs <rgb@redhat.com>
Sr. S/W Engineer, Kernel Security, Base Operating Systems
Remote, Ottawa, Red Hat Canada
IRC: rgb, SunRaycer
Voice: +1.647.777.2635, Internal: (81) 32635
Richard Guy Briggs March 19, 2020, 10:02 p.m. UTC | #28
On 2020-03-18 18:06, Paul Moore wrote:
> On Wed, Mar 18, 2020 at 5:56 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> > On 2020-03-18 17:42, Paul Moore wrote:
> > > On Wed, Mar 18, 2020 at 5:27 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> > > > On 2020-03-18 16:56, Paul Moore wrote:
> > > > > On Fri, Mar 13, 2020 at 2:59 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> > > > > > On 2020-03-13 12:29, Paul Moore wrote:
> > > > > > > On Thu, Mar 12, 2020 at 3:30 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> > > > > > > > On 2020-02-13 16:44, Paul Moore wrote:
> > > > > > > > > This is a bit of a thread-hijack, and for that I apologize, but
> > > > > > > > > another thought crossed my mind while thinking about this issue
> > > > > > > > > further ... Once we support multiple auditd instances, including the
> > > > > > > > > necessary record routing and duplication/multiple-sends (the host
> > > > > > > > > always sees *everything*), we will likely need to find a way to "trim"
> > > > > > > > > the audit container ID (ACID) lists we send in the records.  The
> > > > > > > > > auditd instance running on the host/initns will always see everything,
> > > > > > > > > so it will want the full container ACID list; however an auditd
> > > > > > > > > instance running inside a container really should only see the ACIDs
> > > > > > > > > of any child containers.
> > > > > > > >
> > > > > > > > Agreed.  This should be easy to check and limit, preventing an auditd
> > > > > > > > from seeing any contid that is a parent of its own contid.
> > > > > > > >
> > > > > > > > > For example, imagine a system where the host has containers 1 and 2,
> > > > > > > > > each running an auditd instance.  Inside container 1 there are
> > > > > > > > > containers A and B.  Inside container 2 there are containers Y and Z.
> > > > > > > > > If an audit event is generated in container Z, I would expect the
> > > > > > > > > host's auditd to see a ACID list of "1,Z" but container 1's auditd
> > > > > > > > > should only see an ACID list of "Z".  The auditd running in container
> > > > > > > > > 2 should not see the record at all (that will be relatively
> > > > > > > > > straightforward).  Does that make sense?  Do we have the record
> > > > > > > > > formats properly designed to handle this without too much problem (I'm
> > > > > > > > > not entirely sure we do)?
> > > > > > > >
> > > > > > > > I completely agree and I believe we have record formats that are able to
> > > > > > > > handle this already.
> > > > > > >
> > > > > > > I'm not convinced we do.  What about the cases where we have a field
> > > > > > > with a list of audit container IDs?  How do we handle that?
> > > > > >
> > > > > > I don't understand the problem.  (I think you crossed your 1/2 vs
> > > > > > A/B/Y/Z in your example.) ...
> > > > >
> > > > > It looks like I did, sorry about that.
> > > > >
> > > > > > ... Clarifying the example above, if as you
> > > > > > suggest an event happens in container Z, the hosts's auditd would report
> > > > > >         Z,^2
> > > > > > and the auditd in container 2 would report
> > > > > >         Z,^2
> > > > > > but if there were another auditd running in container Z it would report
> > > > > >         Z
> > > > > > while the auditd in container 1 or A/B would see nothing.
> > > > >
> > > > > Yes.  My concern is how do we handle this to minimize duplicating and
> > > > > rewriting the records?  It isn't so much about the format, although
> > > > > the format is a side effect.
> > > >
> > > > Are you talking about caching, or about divulging more information than
> > > > necessary or even information leaks?  Or even noticing that records that
> > > > need to be generated to two audit daemons share the same contid field
> > > > values and should be generated at the same time or information shared
> > > > between them?  I'd see any of these as optimizations that don't affect
> > > > the api.
> > >
> > > Imagine a record is generated in a container which has more than one
> > > auditd in it's ancestry that should receive this record, how do we
> > > handle that without completely killing performance?  That's my
> > > concern.  If you've already thought up a plan for this - excellent,
> > > please share :)
> >
> > No, I haven't given that much thought other than the correctness and
> > security issues of making sure that each audit daemon is sufficiently
> > isolated to do its job but not jeopardize another audit domain.  Audit
> > already kills performance, according to some...
> >
> > We currently won't have that problem since there can only be one so far.
> > Fixing and optimizing this is part of the next phase of the challenge of
> > adding a second audit daemon.
> >
> > Let's work on correctness and reasonable efficiency for this phase and
> > not focus on a problem we don't yet have.  I wouldn't consider this
> > incurring technical debt at this point.
> 
> I agree, one stage at a time, but the choice we make here is going to
> have a significant impact on what we can do later.  We need to get
> this as "right" as possible; this isn't something we should dismiss
> with a hand-wave as a problem for the next stage.  We don't need an
> implementation, but I would like to see a rough design of how we would
> address this problem.
> 
> > I could see cacheing a contid string from one starting point, but it may
> > be more work to search that cached string to truncate it or add to it
> > when another audit daemon requests a copy of a similar string.  I
> > suppose every full contid string could be generated the first time it is
> > used and parts of it used (start/finish) as needed but that
> > search/indexing may not be worth it.
> 
> I hope we can do better than string manipulations in the kernel.  I'd
> much rather defer generating the ACID list (if possible), than
> generating a list only to keep copying and editing it as the record is
> sent.

At the moment we are stuck with a string-only format.  The contid list
only exists in the kernel.  When do you suggest generating the contid
list?  It sounds like you are hinting at userspace generating that list
from multiple records over the span of audit logs since boot of the
machine.

Even if we had a binary format, the current design would require
generating that list at the time of record generation since it could be
any contiguous subset of a full nested contid list.

> paul moore

- RGB

--
Richard Guy Briggs <rgb@redhat.com>
Sr. S/W Engineer, Kernel Security, Base Operating Systems
Remote, Ottawa, Red Hat Canada
IRC: rgb, SunRaycer
Voice: +1.647.777.2635, Internal: (81) 32635
Paul Moore March 20, 2020, 9:56 p.m. UTC | #29
On Thu, Mar 19, 2020 at 5:48 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> On 2020-03-18 17:47, Paul Moore wrote:
> > On Wed, Mar 18, 2020 at 5:42 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> > > On 2020-03-18 17:01, Paul Moore wrote:
> > > > On Fri, Mar 13, 2020 at 3:23 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> > > > > On 2020-03-13 12:42, Paul Moore wrote:
> > > >
> > > > ...
> > > >
> > > > > > The thread has had a lot of starts/stops, so I may be repeating a
> > > > > > previous suggestion, but one idea would be to still emit a "death
> > > > > > record" when the final task in the audit container ID does die, but
> > > > > > block the particular audit container ID from reuse until it the
> > > > > > SIGNAL2 info has been reported.  This gives us the timely ACID death
> > > > > > notification while still preventing confusion and ambiguity caused by
> > > > > > potentially reusing the ACID before the SIGNAL2 record has been sent;
> > > > > > there is a small nit about the ACID being present in the SIGNAL2
> > > > > > *after* its death, but I think that can be easily explained and
> > > > > > understood by admins.
> > > > >
> > > > > Thinking quickly about possible technical solutions to this, maybe it
> > > > > makes sense to have two counters on a contobj so that we know when the
> > > > > last process in that container exits and can issue the death
> > > > > certificate, but we still block reuse of it until all further references
> > > > > to it have been resolved.  This will likely also make it possible to
> > > > > report the full contid chain in SIGNAL2 records.  This will eliminate
> > > > > some of the issues we are discussing with regards to passing a contobj
> > > > > vs a contid to the audit_log_contid function, but won't eliminate them
> > > > > all because there are still some contids that won't have an object
> > > > > associated with them to make it impossible to look them up in the
> > > > > contobj lists.
> > > >
> > > > I'm not sure you need a full second counter, I imagine a simple flag
> > > > would be okay.  I think you just something to indicate that this ACID
> > > > object is marked as "dead" but it still being held for sanity reasons
> > > > and should not be reused.
> > >
> > > Ok, I see your point.  This refcount can be changed to a flag easily
> > > enough without change to the api if we can be sure that more than one
> > > signal can't be delivered to the audit daemon *and* collected by sig2.
> > > I'll have a more careful look at the audit daemon code to see if I can
> > > determine this.
> >
> > Maybe I'm not understanding your concern, but this isn't really
> > different than any of the other things we track for the auditd signal
> > sender, right?  If we are worried about multiple signals being sent
> > then it applies to everything, not just the audit container ID.
>
> Yes, you are right.  In all other cases the information is simply
> overwritten.  In the case of the audit container identifier any
> previous value is put before a new one is referenced, so only the last
> signal is kept.  So, we only need a flag.  Does a flag implemented with
> a rcu-protected refcount sound reasonable to you?

Well, if I recall correctly you still need to fix the locking in this
patchset so until we see what that looks like it is hard to say for
certain.  Just make sure that the flag is somehow protected from
races; it is probably a lot like the "valid" flags you sometimes see
with RCU protected lists.

> > > Another question occurs to me is that what if the audit daemon is sent a
> > > signal and it cannot or will not collect the sig2 information from the
> > > kernel (SIGKILL?)?  Does that audit container identifier remain dead
> > > until reboot, or do we institute some other form of reaping, possibly
> > > time-based?
> >
> > In order to preserve the integrity of the audit log that ACID value
> > would need to remain unavailable until the ACID which contains the
> > associated auditd is "dead" (no one can request the signal sender's
> > info if that container is dead).
>
> I don't understand why it would be associated with the contid of the
> audit daemon process rather than with the audit daemon process itself.
> How does the signal collection somehow get transferred or delegated to
> another member of that audit daemon's container?

Presumably once we support multiple audit daemons we will need a
struct to contain the associated connection state, with at most one
struct (and one auditd) allowed for a given ACID.  I would expect that
the signal sender info would be part of that state included in that
struct.  If a task sent a signal to it's associated auditd, and no one
ever queried the signal information stored in the per-ACID state
struct, I would expect that the refcount/flag/whatever would remain
held for the signal sender's ACID until the auditd state's ACID died
(the struct would be reaped as part of the ACID death).  In cases
where the container orchestrator blocks sending signals across ACID
boundaries this really isn't an issue as it will all be the same ACID,
but since we don't want to impose any restrictions on what a container
*could* be it is important to make sure we handle the case where the
signal sender's ACID may be different from the associated auditd's
ACID.

> Thinking aloud here, the audit daemon's exit when it calls audit_free()
> needs to ..._put_sig and cancel that audit_sig_cid (which in the future
> will be allocated per auditd rather than the global it is now since
> there is only one audit daemon).
>
> > paul moore
>
> - RGB
>
> --
> Richard Guy Briggs <rgb@redhat.com>
> Sr. S/W Engineer, Kernel Security, Base Operating Systems
> Remote, Ottawa, Red Hat Canada
> IRC: rgb, SunRaycer
> Voice: +1.647.777.2635, Internal: (81) 32635
Paul Moore March 24, 2020, 12:16 a.m. UTC | #30
On Thu, Mar 19, 2020 at 6:03 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> On 2020-03-18 18:06, Paul Moore wrote:

...

> > I hope we can do better than string manipulations in the kernel.  I'd
> > much rather defer generating the ACID list (if possible), than
> > generating a list only to keep copying and editing it as the record is
> > sent.
>
> At the moment we are stuck with a string-only format.

Yes, we are.  That is another topic, and another set of changes I've
been deferring so as to not disrupt the audit container ID work.

I was thinking of what we do inside the kernel between when the record
triggering event happens and when we actually emit the record to
userspace.  Perhaps we collect the ACID information while the event is
occurring, but we defer generating the record until later when we have
a better understanding of what should be included in the ACID list.
It is somewhat similar (but obviously different) to what we do for
PATH records (we collect the pathname info when the path is being
resolved).
Richard Guy Briggs March 24, 2020, 9:01 p.m. UTC | #31
On 2020-03-23 20:16, Paul Moore wrote:
> On Thu, Mar 19, 2020 at 6:03 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> > On 2020-03-18 18:06, Paul Moore wrote:
> 
> ...
> 
> > > I hope we can do better than string manipulations in the kernel.  I'd
> > > much rather defer generating the ACID list (if possible), than
> > > generating a list only to keep copying and editing it as the record is
> > > sent.
> >
> > At the moment we are stuck with a string-only format.
> 
> Yes, we are.  That is another topic, and another set of changes I've
> been deferring so as to not disrupt the audit container ID work.
> 
> I was thinking of what we do inside the kernel between when the record
> triggering event happens and when we actually emit the record to
> userspace.  Perhaps we collect the ACID information while the event is
> occurring, but we defer generating the record until later when we have
> a better understanding of what should be included in the ACID list.
> It is somewhat similar (but obviously different) to what we do for
> PATH records (we collect the pathname info when the path is being
> resolved).

Ok, now I understand your concern.

In the case of NETFILTER_PKT records, the CONTAINER_ID record is the
only other possible record and they are generated at the same time with
a local context.

In the case of any event involving a syscall, that CONTAINER_ID record
is generated at the time of the rest of the event record generation at
syscall exit.

The others are only generated when needed, such as the sig2 reply.

We generally just store the contobj pointer until we actually generate
the CONTAINER_ID (or CONTAINER_OP) record.

> paul moore

- RGB

--
Richard Guy Briggs <rgb@redhat.com>
Sr. S/W Engineer, Kernel Security, Base Operating Systems
Remote, Ottawa, Red Hat Canada
IRC: rgb, SunRaycer
Voice: +1.647.777.2635, Internal: (81) 32635
Richard Guy Briggs March 25, 2020, 12:29 p.m. UTC | #32
On 2020-03-20 17:56, Paul Moore wrote:
> On Thu, Mar 19, 2020 at 5:48 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> > On 2020-03-18 17:47, Paul Moore wrote:
> > > On Wed, Mar 18, 2020 at 5:42 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> > > > On 2020-03-18 17:01, Paul Moore wrote:
> > > > > On Fri, Mar 13, 2020 at 3:23 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> > > > > > On 2020-03-13 12:42, Paul Moore wrote:
> > > > >
> > > > > ...
> > > > >
> > > > > > > The thread has had a lot of starts/stops, so I may be repeating a
> > > > > > > previous suggestion, but one idea would be to still emit a "death
> > > > > > > record" when the final task in the audit container ID does die, but
> > > > > > > block the particular audit container ID from reuse until it the
> > > > > > > SIGNAL2 info has been reported.  This gives us the timely ACID death
> > > > > > > notification while still preventing confusion and ambiguity caused by
> > > > > > > potentially reusing the ACID before the SIGNAL2 record has been sent;
> > > > > > > there is a small nit about the ACID being present in the SIGNAL2
> > > > > > > *after* its death, but I think that can be easily explained and
> > > > > > > understood by admins.
> > > > > >
> > > > > > Thinking quickly about possible technical solutions to this, maybe it
> > > > > > makes sense to have two counters on a contobj so that we know when the
> > > > > > last process in that container exits and can issue the death
> > > > > > certificate, but we still block reuse of it until all further references
> > > > > > to it have been resolved.  This will likely also make it possible to
> > > > > > report the full contid chain in SIGNAL2 records.  This will eliminate
> > > > > > some of the issues we are discussing with regards to passing a contobj
> > > > > > vs a contid to the audit_log_contid function, but won't eliminate them
> > > > > > all because there are still some contids that won't have an object
> > > > > > associated with them to make it impossible to look them up in the
> > > > > > contobj lists.
> > > > >
> > > > > I'm not sure you need a full second counter, I imagine a simple flag
> > > > > would be okay.  I think you just something to indicate that this ACID
> > > > > object is marked as "dead" but it still being held for sanity reasons
> > > > > and should not be reused.
> > > >
> > > > Ok, I see your point.  This refcount can be changed to a flag easily
> > > > enough without change to the api if we can be sure that more than one
> > > > signal can't be delivered to the audit daemon *and* collected by sig2.
> > > > I'll have a more careful look at the audit daemon code to see if I can
> > > > determine this.
> > >
> > > Maybe I'm not understanding your concern, but this isn't really
> > > different than any of the other things we track for the auditd signal
> > > sender, right?  If we are worried about multiple signals being sent
> > > then it applies to everything, not just the audit container ID.
> >
> > Yes, you are right.  In all other cases the information is simply
> > overwritten.  In the case of the audit container identifier any
> > previous value is put before a new one is referenced, so only the last
> > signal is kept.  So, we only need a flag.  Does a flag implemented with
> > a rcu-protected refcount sound reasonable to you?
> 
> Well, if I recall correctly you still need to fix the locking in this
> patchset so until we see what that looks like it is hard to say for
> certain.  Just make sure that the flag is somehow protected from
> races; it is probably a lot like the "valid" flags you sometimes see
> with RCU protected lists.

This is like looking for a needle in a haystack.  Can you point me to
some code that does "valid" flags with RCU protected lists.

> > > > Another question occurs to me is that what if the audit daemon is sent a
> > > > signal and it cannot or will not collect the sig2 information from the
> > > > kernel (SIGKILL?)?  Does that audit container identifier remain dead
> > > > until reboot, or do we institute some other form of reaping, possibly
> > > > time-based?
> > >
> > > In order to preserve the integrity of the audit log that ACID value
> > > would need to remain unavailable until the ACID which contains the
> > > associated auditd is "dead" (no one can request the signal sender's
> > > info if that container is dead).
> >
> > I don't understand why it would be associated with the contid of the
> > audit daemon process rather than with the audit daemon process itself.
> > How does the signal collection somehow get transferred or delegated to
> > another member of that audit daemon's container?
> 
> Presumably once we support multiple audit daemons we will need a
> struct to contain the associated connection state, with at most one
> struct (and one auditd) allowed for a given ACID.  I would expect that
> the signal sender info would be part of that state included in that
> struct.  If a task sent a signal to it's associated auditd, and no one
> ever queried the signal information stored in the per-ACID state
> struct, I would expect that the refcount/flag/whatever would remain
> held for the signal sender's ACID until the auditd state's ACID died
> (the struct would be reaped as part of the ACID death).  In cases
> where the container orchestrator blocks sending signals across ACID
> boundaries this really isn't an issue as it will all be the same ACID,
> but since we don't want to impose any restrictions on what a container
> *could* be it is important to make sure we handle the case where the
> signal sender's ACID may be different from the associated auditd's
> ACID.
> 
> > Thinking aloud here, the audit daemon's exit when it calls audit_free()
> > needs to ..._put_sig and cancel that audit_sig_cid (which in the future
> > will be allocated per auditd rather than the global it is now since
> > there is only one audit daemon).
> >
> > > paul moore
> >
> > - RGB
> 
> paul moore

- RGB

--
Richard Guy Briggs <rgb@redhat.com>
Sr. S/W Engineer, Kernel Security, Base Operating Systems
Remote, Ottawa, Red Hat Canada
IRC: rgb, SunRaycer
Voice: +1.647.777.2635, Internal: (81) 32635
Paul Moore March 29, 2020, 3:11 a.m. UTC | #33
On Tue, Mar 24, 2020 at 5:02 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> On 2020-03-23 20:16, Paul Moore wrote:
> > On Thu, Mar 19, 2020 at 6:03 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> > > On 2020-03-18 18:06, Paul Moore wrote:
> >
> > ...
> >
> > > > I hope we can do better than string manipulations in the kernel.  I'd
> > > > much rather defer generating the ACID list (if possible), than
> > > > generating a list only to keep copying and editing it as the record is
> > > > sent.
> > >
> > > At the moment we are stuck with a string-only format.
> >
> > Yes, we are.  That is another topic, and another set of changes I've
> > been deferring so as to not disrupt the audit container ID work.
> >
> > I was thinking of what we do inside the kernel between when the record
> > triggering event happens and when we actually emit the record to
> > userspace.  Perhaps we collect the ACID information while the event is
> > occurring, but we defer generating the record until later when we have
> > a better understanding of what should be included in the ACID list.
> > It is somewhat similar (but obviously different) to what we do for
> > PATH records (we collect the pathname info when the path is being
> > resolved).
>
> Ok, now I understand your concern.
>
> In the case of NETFILTER_PKT records, the CONTAINER_ID record is the
> only other possible record and they are generated at the same time with
> a local context.
>
> In the case of any event involving a syscall, that CONTAINER_ID record
> is generated at the time of the rest of the event record generation at
> syscall exit.
>
> The others are only generated when needed, such as the sig2 reply.
>
> We generally just store the contobj pointer until we actually generate
> the CONTAINER_ID (or CONTAINER_OP) record.

Perhaps I'm remembering your latest spin of these patches incorrectly,
but there is still a big gap between when the record is generated and
when it is sent up to the audit daemon.  Most importantly in that gap
is the whole big queue/multicast/unicast mess.

You don't need to show me code, but I would like to see some sort of
plan for dealing with multiple nested audit daemons.  Basically I just
want to make sure we aren't painting ourselves into a corner with this
approach; and if for some horrible reason we are, I at least want us
to be aware of what we are getting ourselves into.
Paul Moore March 29, 2020, 3:17 a.m. UTC | #34
On Wed, Mar 25, 2020 at 8:29 AM Richard Guy Briggs <rgb@redhat.com> wrote:
> On 2020-03-20 17:56, Paul Moore wrote:
> > On Thu, Mar 19, 2020 at 5:48 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> > > On 2020-03-18 17:47, Paul Moore wrote:
> > > > On Wed, Mar 18, 2020 at 5:42 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> > > > > On 2020-03-18 17:01, Paul Moore wrote:
> > > > > > On Fri, Mar 13, 2020 at 3:23 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> > > > > > > On 2020-03-13 12:42, Paul Moore wrote:
> > > > > >
> > > > > > ...
> > > > > >
> > > > > > > > The thread has had a lot of starts/stops, so I may be repeating a
> > > > > > > > previous suggestion, but one idea would be to still emit a "death
> > > > > > > > record" when the final task in the audit container ID does die, but
> > > > > > > > block the particular audit container ID from reuse until it the
> > > > > > > > SIGNAL2 info has been reported.  This gives us the timely ACID death
> > > > > > > > notification while still preventing confusion and ambiguity caused by
> > > > > > > > potentially reusing the ACID before the SIGNAL2 record has been sent;
> > > > > > > > there is a small nit about the ACID being present in the SIGNAL2
> > > > > > > > *after* its death, but I think that can be easily explained and
> > > > > > > > understood by admins.
> > > > > > >
> > > > > > > Thinking quickly about possible technical solutions to this, maybe it
> > > > > > > makes sense to have two counters on a contobj so that we know when the
> > > > > > > last process in that container exits and can issue the death
> > > > > > > certificate, but we still block reuse of it until all further references
> > > > > > > to it have been resolved.  This will likely also make it possible to
> > > > > > > report the full contid chain in SIGNAL2 records.  This will eliminate
> > > > > > > some of the issues we are discussing with regards to passing a contobj
> > > > > > > vs a contid to the audit_log_contid function, but won't eliminate them
> > > > > > > all because there are still some contids that won't have an object
> > > > > > > associated with them to make it impossible to look them up in the
> > > > > > > contobj lists.
> > > > > >
> > > > > > I'm not sure you need a full second counter, I imagine a simple flag
> > > > > > would be okay.  I think you just something to indicate that this ACID
> > > > > > object is marked as "dead" but it still being held for sanity reasons
> > > > > > and should not be reused.
> > > > >
> > > > > Ok, I see your point.  This refcount can be changed to a flag easily
> > > > > enough without change to the api if we can be sure that more than one
> > > > > signal can't be delivered to the audit daemon *and* collected by sig2.
> > > > > I'll have a more careful look at the audit daemon code to see if I can
> > > > > determine this.
> > > >
> > > > Maybe I'm not understanding your concern, but this isn't really
> > > > different than any of the other things we track for the auditd signal
> > > > sender, right?  If we are worried about multiple signals being sent
> > > > then it applies to everything, not just the audit container ID.
> > >
> > > Yes, you are right.  In all other cases the information is simply
> > > overwritten.  In the case of the audit container identifier any
> > > previous value is put before a new one is referenced, so only the last
> > > signal is kept.  So, we only need a flag.  Does a flag implemented with
> > > a rcu-protected refcount sound reasonable to you?
> >
> > Well, if I recall correctly you still need to fix the locking in this
> > patchset so until we see what that looks like it is hard to say for
> > certain.  Just make sure that the flag is somehow protected from
> > races; it is probably a lot like the "valid" flags you sometimes see
> > with RCU protected lists.
>
> This is like looking for a needle in a haystack.  Can you point me to
> some code that does "valid" flags with RCU protected lists.

Sigh.  Come on Richard, you've been playing in the kernel for some
time now.  I can't think of one off the top of my head as I write
this, but there are several resources that deal with RCU protected
lists in the kernel, Google is your friend and Documentation/RCU is
your friend.

Spending time to learn how RCU works and how to use it properly is not
time wasted.  It's a tricky thing to get right (I have to refresh my
memory on some of the more subtle details each time I write/review RCU
code), but it's very cool when done correctly.
Richard Guy Briggs March 30, 2020, 1:47 p.m. UTC | #35
On 2020-03-28 23:11, Paul Moore wrote:
> On Tue, Mar 24, 2020 at 5:02 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> > On 2020-03-23 20:16, Paul Moore wrote:
> > > On Thu, Mar 19, 2020 at 6:03 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> > > > On 2020-03-18 18:06, Paul Moore wrote:
> > >
> > > ...
> > >
> > > > > I hope we can do better than string manipulations in the kernel.  I'd
> > > > > much rather defer generating the ACID list (if possible), than
> > > > > generating a list only to keep copying and editing it as the record is
> > > > > sent.
> > > >
> > > > At the moment we are stuck with a string-only format.
> > >
> > > Yes, we are.  That is another topic, and another set of changes I've
> > > been deferring so as to not disrupt the audit container ID work.
> > >
> > > I was thinking of what we do inside the kernel between when the record
> > > triggering event happens and when we actually emit the record to
> > > userspace.  Perhaps we collect the ACID information while the event is
> > > occurring, but we defer generating the record until later when we have
> > > a better understanding of what should be included in the ACID list.
> > > It is somewhat similar (but obviously different) to what we do for
> > > PATH records (we collect the pathname info when the path is being
> > > resolved).
> >
> > Ok, now I understand your concern.
> >
> > In the case of NETFILTER_PKT records, the CONTAINER_ID record is the
> > only other possible record and they are generated at the same time with
> > a local context.
> >
> > In the case of any event involving a syscall, that CONTAINER_ID record
> > is generated at the time of the rest of the event record generation at
> > syscall exit.
> >
> > The others are only generated when needed, such as the sig2 reply.
> >
> > We generally just store the contobj pointer until we actually generate
> > the CONTAINER_ID (or CONTAINER_OP) record.
> 
> Perhaps I'm remembering your latest spin of these patches incorrectly,
> but there is still a big gap between when the record is generated and
> when it is sent up to the audit daemon.  Most importantly in that gap
> is the whole big queue/multicast/unicast mess.

So you suggest generating that record on the fly once it reaches the end
of the audit_queue just before being sent?  That sounds...  disruptive.
Each audit daemon is going to have its own queues, so by the time it
ends up in a particular queue, we'll already know its scope and would
have the right list of contids to print in that record.

I don't see the point in deferring the generation of the contid list
beyond the point of submitting that record to the relevant audit_queue.

> You don't need to show me code, but I would like to see some sort of
> plan for dealing with multiple nested audit daemons.  Basically I just
> want to make sure we aren't painting ourselves into a corner with this
> approach; and if for some horrible reason we are, I at least want us
> to be aware of what we are getting ourselves into.

It wouldn't be significantly different from what we have, but as would
have to happen for *all* records generated to a particular auditd/queue
it would have to take the scope of that auditd into account, getting
references to PIDs right for that PID namespace, along with other
similar scope views including contid list range.

> paul moore

- RGB

--
Richard Guy Briggs <rgb@redhat.com>
Sr. S/W Engineer, Kernel Security, Base Operating Systems
Remote, Ottawa, Red Hat Canada
IRC: rgb, SunRaycer
Voice: +1.647.777.2635, Internal: (81) 32635
Paul Moore March 30, 2020, 2:26 p.m. UTC | #36
On Mon, Mar 30, 2020 at 9:47 AM Richard Guy Briggs <rgb@redhat.com> wrote:
> On 2020-03-28 23:11, Paul Moore wrote:
> > On Tue, Mar 24, 2020 at 5:02 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> > > On 2020-03-23 20:16, Paul Moore wrote:
> > > > On Thu, Mar 19, 2020 at 6:03 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> > > > > On 2020-03-18 18:06, Paul Moore wrote:
> > > >
> > > > ...
> > > >
> > > > > > I hope we can do better than string manipulations in the kernel.  I'd
> > > > > > much rather defer generating the ACID list (if possible), than
> > > > > > generating a list only to keep copying and editing it as the record is
> > > > > > sent.
> > > > >
> > > > > At the moment we are stuck with a string-only format.
> > > >
> > > > Yes, we are.  That is another topic, and another set of changes I've
> > > > been deferring so as to not disrupt the audit container ID work.
> > > >
> > > > I was thinking of what we do inside the kernel between when the record
> > > > triggering event happens and when we actually emit the record to
> > > > userspace.  Perhaps we collect the ACID information while the event is
> > > > occurring, but we defer generating the record until later when we have
> > > > a better understanding of what should be included in the ACID list.
> > > > It is somewhat similar (but obviously different) to what we do for
> > > > PATH records (we collect the pathname info when the path is being
> > > > resolved).
> > >
> > > Ok, now I understand your concern.
> > >
> > > In the case of NETFILTER_PKT records, the CONTAINER_ID record is the
> > > only other possible record and they are generated at the same time with
> > > a local context.
> > >
> > > In the case of any event involving a syscall, that CONTAINER_ID record
> > > is generated at the time of the rest of the event record generation at
> > > syscall exit.
> > >
> > > The others are only generated when needed, such as the sig2 reply.
> > >
> > > We generally just store the contobj pointer until we actually generate
> > > the CONTAINER_ID (or CONTAINER_OP) record.
> >
> > Perhaps I'm remembering your latest spin of these patches incorrectly,
> > but there is still a big gap between when the record is generated and
> > when it is sent up to the audit daemon.  Most importantly in that gap
> > is the whole big queue/multicast/unicast mess.
>
> So you suggest generating that record on the fly once it reaches the end
> of the audit_queue just before being sent?  That sounds...  disruptive.
> Each audit daemon is going to have its own queues, so by the time it
> ends up in a particular queue, we'll already know its scope and would
> have the right list of contids to print in that record.

I'm not suggesting any particular solution, I'm just pointing out a
potential problem.  It isn't clear to me that you've thought about how
we generate a multiple records, each with the correct ACID list
intended for a specific audit daemon, based on a single audit event.
Explain to me how you intend that to work and we are good.  Be
specific because I'm not convinced we are talking on the same plane
here.
Richard Guy Briggs March 30, 2020, 3:23 p.m. UTC | #37
On 2020-03-28 23:17, Paul Moore wrote:
> On Wed, Mar 25, 2020 at 8:29 AM Richard Guy Briggs <rgb@redhat.com> wrote:
> > On 2020-03-20 17:56, Paul Moore wrote:
> > > On Thu, Mar 19, 2020 at 5:48 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> > > > On 2020-03-18 17:47, Paul Moore wrote:
> > > > > On Wed, Mar 18, 2020 at 5:42 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> > > > > > On 2020-03-18 17:01, Paul Moore wrote:
> > > > > > > On Fri, Mar 13, 2020 at 3:23 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> > > > > > > > On 2020-03-13 12:42, Paul Moore wrote:
> > > > > > >
> > > > > > > ...
> > > > > > >
> > > > > > > > > The thread has had a lot of starts/stops, so I may be repeating a
> > > > > > > > > previous suggestion, but one idea would be to still emit a "death
> > > > > > > > > record" when the final task in the audit container ID does die, but
> > > > > > > > > block the particular audit container ID from reuse until it the
> > > > > > > > > SIGNAL2 info has been reported.  This gives us the timely ACID death
> > > > > > > > > notification while still preventing confusion and ambiguity caused by
> > > > > > > > > potentially reusing the ACID before the SIGNAL2 record has been sent;
> > > > > > > > > there is a small nit about the ACID being present in the SIGNAL2
> > > > > > > > > *after* its death, but I think that can be easily explained and
> > > > > > > > > understood by admins.
> > > > > > > >
> > > > > > > > Thinking quickly about possible technical solutions to this, maybe it
> > > > > > > > makes sense to have two counters on a contobj so that we know when the
> > > > > > > > last process in that container exits and can issue the death
> > > > > > > > certificate, but we still block reuse of it until all further references
> > > > > > > > to it have been resolved.  This will likely also make it possible to
> > > > > > > > report the full contid chain in SIGNAL2 records.  This will eliminate
> > > > > > > > some of the issues we are discussing with regards to passing a contobj
> > > > > > > > vs a contid to the audit_log_contid function, but won't eliminate them
> > > > > > > > all because there are still some contids that won't have an object
> > > > > > > > associated with them to make it impossible to look them up in the
> > > > > > > > contobj lists.
> > > > > > >
> > > > > > > I'm not sure you need a full second counter, I imagine a simple flag
> > > > > > > would be okay.  I think you just something to indicate that this ACID
> > > > > > > object is marked as "dead" but it still being held for sanity reasons
> > > > > > > and should not be reused.
> > > > > >
> > > > > > Ok, I see your point.  This refcount can be changed to a flag easily
> > > > > > enough without change to the api if we can be sure that more than one
> > > > > > signal can't be delivered to the audit daemon *and* collected by sig2.
> > > > > > I'll have a more careful look at the audit daemon code to see if I can
> > > > > > determine this.
> > > > >
> > > > > Maybe I'm not understanding your concern, but this isn't really
> > > > > different than any of the other things we track for the auditd signal
> > > > > sender, right?  If we are worried about multiple signals being sent
> > > > > then it applies to everything, not just the audit container ID.
> > > >
> > > > Yes, you are right.  In all other cases the information is simply
> > > > overwritten.  In the case of the audit container identifier any
> > > > previous value is put before a new one is referenced, so only the last
> > > > signal is kept.  So, we only need a flag.  Does a flag implemented with
> > > > a rcu-protected refcount sound reasonable to you?
> > >
> > > Well, if I recall correctly you still need to fix the locking in this
> > > patchset so until we see what that looks like it is hard to say for
> > > certain.  Just make sure that the flag is somehow protected from
> > > races; it is probably a lot like the "valid" flags you sometimes see
> > > with RCU protected lists.
> >
> > This is like looking for a needle in a haystack.  Can you point me to
> > some code that does "valid" flags with RCU protected lists.
> 
> Sigh.  Come on Richard, you've been playing in the kernel for some
> time now.  I can't think of one off the top of my head as I write
> this, but there are several resources that deal with RCU protected
> lists in the kernel, Google is your friend and Documentation/RCU is
> your friend.

Ok, I thought you were talking about a specific piece of code...

> Spending time to learn how RCU works and how to use it properly is not
> time wasted.  It's a tricky thing to get right (I have to refresh my
> memory on some of the more subtle details each time I write/review RCU
> code), but it's very cool when done correctly.

I review Documentation/RCU almost every time I work on RCU...

> paul moore

- RGB

--
Richard Guy Briggs <rgb@redhat.com>
Sr. S/W Engineer, Kernel Security, Base Operating Systems
Remote, Ottawa, Red Hat Canada
IRC: rgb, SunRaycer
Voice: +1.647.777.2635, Internal: (81) 32635
Richard Guy Briggs March 30, 2020, 4:21 p.m. UTC | #38
On 2020-03-30 10:26, Paul Moore wrote:
> On Mon, Mar 30, 2020 at 9:47 AM Richard Guy Briggs <rgb@redhat.com> wrote:
> > On 2020-03-28 23:11, Paul Moore wrote:
> > > On Tue, Mar 24, 2020 at 5:02 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> > > > On 2020-03-23 20:16, Paul Moore wrote:
> > > > > On Thu, Mar 19, 2020 at 6:03 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> > > > > > On 2020-03-18 18:06, Paul Moore wrote:
> > > > >
> > > > > ...
> > > > >
> > > > > > > I hope we can do better than string manipulations in the kernel.  I'd
> > > > > > > much rather defer generating the ACID list (if possible), than
> > > > > > > generating a list only to keep copying and editing it as the record is
> > > > > > > sent.
> > > > > >
> > > > > > At the moment we are stuck with a string-only format.
> > > > >
> > > > > Yes, we are.  That is another topic, and another set of changes I've
> > > > > been deferring so as to not disrupt the audit container ID work.
> > > > >
> > > > > I was thinking of what we do inside the kernel between when the record
> > > > > triggering event happens and when we actually emit the record to
> > > > > userspace.  Perhaps we collect the ACID information while the event is
> > > > > occurring, but we defer generating the record until later when we have
> > > > > a better understanding of what should be included in the ACID list.
> > > > > It is somewhat similar (but obviously different) to what we do for
> > > > > PATH records (we collect the pathname info when the path is being
> > > > > resolved).
> > > >
> > > > Ok, now I understand your concern.
> > > >
> > > > In the case of NETFILTER_PKT records, the CONTAINER_ID record is the
> > > > only other possible record and they are generated at the same time with
> > > > a local context.
> > > >
> > > > In the case of any event involving a syscall, that CONTAINER_ID record
> > > > is generated at the time of the rest of the event record generation at
> > > > syscall exit.
> > > >
> > > > The others are only generated when needed, such as the sig2 reply.
> > > >
> > > > We generally just store the contobj pointer until we actually generate
> > > > the CONTAINER_ID (or CONTAINER_OP) record.
> > >
> > > Perhaps I'm remembering your latest spin of these patches incorrectly,
> > > but there is still a big gap between when the record is generated and
> > > when it is sent up to the audit daemon.  Most importantly in that gap
> > > is the whole big queue/multicast/unicast mess.
> >
> > So you suggest generating that record on the fly once it reaches the end
> > of the audit_queue just before being sent?  That sounds...  disruptive.
> > Each audit daemon is going to have its own queues, so by the time it
> > ends up in a particular queue, we'll already know its scope and would
> > have the right list of contids to print in that record.
> 
> I'm not suggesting any particular solution, I'm just pointing out a
> potential problem.  It isn't clear to me that you've thought about how
> we generate a multiple records, each with the correct ACID list
> intended for a specific audit daemon, based on a single audit event.
> Explain to me how you intend that to work and we are good.  Be
> specific because I'm not convinced we are talking on the same plane
> here.

Well, every time a record gets generated, *any* record gets generated,
we'll need to check for which audit daemons this record is in scope and
generate a different one for each depending on the content and whether
or not the content is influenced by the scope.  Some events will be
generated for some of the auditd/queues and not for others.  Some fields
in some of the records will need to be tailored for that specific
auditd/queue for either contid scope or PID namespace base reference or
other scope differences.

Every auditd/queue will need its own serial number per event and maybe
even timestamp depending on whether that auditd is in a different time
namespace and beyond that PID and contid fields and maybe others will
need to be customized per auditd/queue.  So, it may make sense to
generate the contents of each field for a generic record and then either
reuse content that is unchanged or generate new content for a field that
will be different in a different auditd/queue scope, then render the
final record per auditd/queue and enqueue it.

I see this as the primary work of ghak93 ("RFE: run multiple audit
daemons on one machine").  I don't see how our proposed contid field
value format changes with this path above.

This is getting closer and closer to a netlink binary format too...

This is also an argument for spreading fields out over more record types
rather than cramming as much information as we can into one record type
(subject attributes in particular).

> paul moore

- RGB

--
Richard Guy Briggs <rgb@redhat.com>
Sr. S/W Engineer, Kernel Security, Base Operating Systems
Remote, Ottawa, Red Hat Canada
IRC: rgb, SunRaycer
Voice: +1.647.777.2635, Internal: (81) 32635
Paul Moore March 30, 2020, 5:34 p.m. UTC | #39
On Mon, Mar 30, 2020 at 12:22 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> On 2020-03-30 10:26, Paul Moore wrote:
> > On Mon, Mar 30, 2020 at 9:47 AM Richard Guy Briggs <rgb@redhat.com> wrote:
> > > On 2020-03-28 23:11, Paul Moore wrote:
> > > > On Tue, Mar 24, 2020 at 5:02 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> > > > > On 2020-03-23 20:16, Paul Moore wrote:
> > > > > > On Thu, Mar 19, 2020 at 6:03 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> > > > > > > On 2020-03-18 18:06, Paul Moore wrote:
> > > > > >
> > > > > > ...
> > > > > >
> > > > > > > > I hope we can do better than string manipulations in the kernel.  I'd
> > > > > > > > much rather defer generating the ACID list (if possible), than
> > > > > > > > generating a list only to keep copying and editing it as the record is
> > > > > > > > sent.
> > > > > > >
> > > > > > > At the moment we are stuck with a string-only format.
> > > > > >
> > > > > > Yes, we are.  That is another topic, and another set of changes I've
> > > > > > been deferring so as to not disrupt the audit container ID work.
> > > > > >
> > > > > > I was thinking of what we do inside the kernel between when the record
> > > > > > triggering event happens and when we actually emit the record to
> > > > > > userspace.  Perhaps we collect the ACID information while the event is
> > > > > > occurring, but we defer generating the record until later when we have
> > > > > > a better understanding of what should be included in the ACID list.
> > > > > > It is somewhat similar (but obviously different) to what we do for
> > > > > > PATH records (we collect the pathname info when the path is being
> > > > > > resolved).
> > > > >
> > > > > Ok, now I understand your concern.
> > > > >
> > > > > In the case of NETFILTER_PKT records, the CONTAINER_ID record is the
> > > > > only other possible record and they are generated at the same time with
> > > > > a local context.
> > > > >
> > > > > In the case of any event involving a syscall, that CONTAINER_ID record
> > > > > is generated at the time of the rest of the event record generation at
> > > > > syscall exit.
> > > > >
> > > > > The others are only generated when needed, such as the sig2 reply.
> > > > >
> > > > > We generally just store the contobj pointer until we actually generate
> > > > > the CONTAINER_ID (or CONTAINER_OP) record.
> > > >
> > > > Perhaps I'm remembering your latest spin of these patches incorrectly,
> > > > but there is still a big gap between when the record is generated and
> > > > when it is sent up to the audit daemon.  Most importantly in that gap
> > > > is the whole big queue/multicast/unicast mess.
> > >
> > > So you suggest generating that record on the fly once it reaches the end
> > > of the audit_queue just before being sent?  That sounds...  disruptive.
> > > Each audit daemon is going to have its own queues, so by the time it
> > > ends up in a particular queue, we'll already know its scope and would
> > > have the right list of contids to print in that record.
> >
> > I'm not suggesting any particular solution, I'm just pointing out a
> > potential problem.  It isn't clear to me that you've thought about how
> > we generate a multiple records, each with the correct ACID list
> > intended for a specific audit daemon, based on a single audit event.
> > Explain to me how you intend that to work and we are good.  Be
> > specific because I'm not convinced we are talking on the same plane
> > here.
>
> Well, every time a record gets generated, *any* record gets generated,
> we'll need to check for which audit daemons this record is in scope and
> generate a different one for each depending on the content and whether
> or not the content is influenced by the scope.

That's the problem right there - we don't want to have to generate a
unique record for *each* auditd on *every* record.  That is a recipe
for disaster.

Solving this for all of the known audit records is not something we
need to worry about in depth at the moment (although giving it some
casual thought is not a bad thing), but solving this for the audit
container ID information *is* something we need to worry about right
now.
Richard Guy Briggs March 30, 2020, 5:49 p.m. UTC | #40
On 2020-03-30 13:34, Paul Moore wrote:
> On Mon, Mar 30, 2020 at 12:22 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> > On 2020-03-30 10:26, Paul Moore wrote:
> > > On Mon, Mar 30, 2020 at 9:47 AM Richard Guy Briggs <rgb@redhat.com> wrote:
> > > > On 2020-03-28 23:11, Paul Moore wrote:
> > > > > On Tue, Mar 24, 2020 at 5:02 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> > > > > > On 2020-03-23 20:16, Paul Moore wrote:
> > > > > > > On Thu, Mar 19, 2020 at 6:03 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> > > > > > > > On 2020-03-18 18:06, Paul Moore wrote:
> > > > > > >
> > > > > > > ...
> > > > > > >
> > > > > > > > > I hope we can do better than string manipulations in the kernel.  I'd
> > > > > > > > > much rather defer generating the ACID list (if possible), than
> > > > > > > > > generating a list only to keep copying and editing it as the record is
> > > > > > > > > sent.
> > > > > > > >
> > > > > > > > At the moment we are stuck with a string-only format.
> > > > > > >
> > > > > > > Yes, we are.  That is another topic, and another set of changes I've
> > > > > > > been deferring so as to not disrupt the audit container ID work.
> > > > > > >
> > > > > > > I was thinking of what we do inside the kernel between when the record
> > > > > > > triggering event happens and when we actually emit the record to
> > > > > > > userspace.  Perhaps we collect the ACID information while the event is
> > > > > > > occurring, but we defer generating the record until later when we have
> > > > > > > a better understanding of what should be included in the ACID list.
> > > > > > > It is somewhat similar (but obviously different) to what we do for
> > > > > > > PATH records (we collect the pathname info when the path is being
> > > > > > > resolved).
> > > > > >
> > > > > > Ok, now I understand your concern.
> > > > > >
> > > > > > In the case of NETFILTER_PKT records, the CONTAINER_ID record is the
> > > > > > only other possible record and they are generated at the same time with
> > > > > > a local context.
> > > > > >
> > > > > > In the case of any event involving a syscall, that CONTAINER_ID record
> > > > > > is generated at the time of the rest of the event record generation at
> > > > > > syscall exit.
> > > > > >
> > > > > > The others are only generated when needed, such as the sig2 reply.
> > > > > >
> > > > > > We generally just store the contobj pointer until we actually generate
> > > > > > the CONTAINER_ID (or CONTAINER_OP) record.
> > > > >
> > > > > Perhaps I'm remembering your latest spin of these patches incorrectly,
> > > > > but there is still a big gap between when the record is generated and
> > > > > when it is sent up to the audit daemon.  Most importantly in that gap
> > > > > is the whole big queue/multicast/unicast mess.
> > > >
> > > > So you suggest generating that record on the fly once it reaches the end
> > > > of the audit_queue just before being sent?  That sounds...  disruptive.
> > > > Each audit daemon is going to have its own queues, so by the time it
> > > > ends up in a particular queue, we'll already know its scope and would
> > > > have the right list of contids to print in that record.
> > >
> > > I'm not suggesting any particular solution, I'm just pointing out a
> > > potential problem.  It isn't clear to me that you've thought about how
> > > we generate a multiple records, each with the correct ACID list
> > > intended for a specific audit daemon, based on a single audit event.
> > > Explain to me how you intend that to work and we are good.  Be
> > > specific because I'm not convinced we are talking on the same plane
> > > here.
> >
> > Well, every time a record gets generated, *any* record gets generated,
> > we'll need to check for which audit daemons this record is in scope and
> > generate a different one for each depending on the content and whether
> > or not the content is influenced by the scope.
> 
> That's the problem right there - we don't want to have to generate a
> unique record for *each* auditd on *every* record.  That is a recipe
> for disaster.

I don't see how we can get around this.

We will already have that problem for PIDs in different PID namespaces.

We already need to use a different serial number in each auditd/queue,
or else we serialize *all* audit events on the machine and either leak
information to the nested daemons that there are other events happenning
on the machine, or confuse the host daemon because it now thinks that we
are losing events due to serial numbers missing because some nested
daemon issued an event that was not relevant to the host daemon,
consuming a globally serial audit message sequence number.

> Solving this for all of the known audit records is not something we
> need to worry about in depth at the moment (although giving it some
> casual thought is not a bad thing), but solving this for the audit
> container ID information *is* something we need to worry about right
> now.

If you think that a different nested contid value string per daemon is
not acceptable, then we are back to issuing a record that has only *one*
contid listed without any nesting information.  This brings us back to
the original problem of keeping *all* audit log history since the boot
of the machine to be able to track the nesting of any particular contid.

What am I missing?  What do you suggest?

> paul moore

- RGB

--
Richard Guy Briggs <rgb@redhat.com>
Sr. S/W Engineer, Kernel Security, Base Operating Systems
Remote, Ottawa, Red Hat Canada
IRC: rgb, SunRaycer
Voice: +1.647.777.2635, Internal: (81) 32635
Paul Moore March 30, 2020, 7:55 p.m. UTC | #41
On Mon, Mar 30, 2020 at 1:49 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> On 2020-03-30 13:34, Paul Moore wrote:
> > On Mon, Mar 30, 2020 at 12:22 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> > > On 2020-03-30 10:26, Paul Moore wrote:
> > > > On Mon, Mar 30, 2020 at 9:47 AM Richard Guy Briggs <rgb@redhat.com> wrote:
> > > > > On 2020-03-28 23:11, Paul Moore wrote:
> > > > > > On Tue, Mar 24, 2020 at 5:02 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> > > > > > > On 2020-03-23 20:16, Paul Moore wrote:
> > > > > > > > On Thu, Mar 19, 2020 at 6:03 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> > > > > > > > > On 2020-03-18 18:06, Paul Moore wrote:

...

> > > Well, every time a record gets generated, *any* record gets generated,
> > > we'll need to check for which audit daemons this record is in scope and
> > > generate a different one for each depending on the content and whether
> > > or not the content is influenced by the scope.
> >
> > That's the problem right there - we don't want to have to generate a
> > unique record for *each* auditd on *every* record.  That is a recipe
> > for disaster.
>
> I don't see how we can get around this.
>
> We will already have that problem for PIDs in different PID namespaces.

As I said below, let's not worry about this for all of the
known/current audit records, lets just think about how we solve this
for the ACID related information.

One of the bigger problems with translating namespace info (e.g. PIDs)
across ACIDs is that an ACID - by definition - has no understanding of
namespaces (both the concept as well as any given instance).

> We already need to use a different serial number in each auditd/queue,
> or else we serialize *all* audit events on the machine and either leak
> information to the nested daemons that there are other events happenning
> on the machine, or confuse the host daemon because it now thinks that we
> are losing events due to serial numbers missing because some nested
> daemon issued an event that was not relevant to the host daemon,
> consuming a globally serial audit message sequence number.

This isn't really relevant to the ACID lists, but sure.

> > Solving this for all of the known audit records is not something we
> > need to worry about in depth at the moment (although giving it some
> > casual thought is not a bad thing), but solving this for the audit
> > container ID information *is* something we need to worry about right
> > now.
>
> If you think that a different nested contid value string per daemon is
> not acceptable, then we are back to issuing a record that has only *one*
> contid listed without any nesting information.  This brings us back to
> the original problem of keeping *all* audit log history since the boot
> of the machine to be able to track the nesting of any particular contid.

I'm not ruling anything out, except for the "let's just completely
regenerate every record for each auditd instance".

> What am I missing?  What do you suggest?

I'm missing a solution in this thread, since you are the person
driving this effort I'm asking you to get creative and present us with
some solutions. :)
Eric W. Biederman April 16, 2020, 8:33 p.m. UTC | #42
Paul Moore <paul@paul-moore.com> writes:

> On Mon, Mar 30, 2020 at 1:49 PM Richard Guy Briggs <rgb@redhat.com> wrote:
>> On 2020-03-30 13:34, Paul Moore wrote:
>> > On Mon, Mar 30, 2020 at 12:22 PM Richard Guy Briggs <rgb@redhat.com> wrote:
>> > > On 2020-03-30 10:26, Paul Moore wrote:
>> > > > On Mon, Mar 30, 2020 at 9:47 AM Richard Guy Briggs <rgb@redhat.com> wrote:
>> > > > > On 2020-03-28 23:11, Paul Moore wrote:
>> > > > > > On Tue, Mar 24, 2020 at 5:02 PM Richard Guy Briggs <rgb@redhat.com> wrote:
>> > > > > > > On 2020-03-23 20:16, Paul Moore wrote:
>> > > > > > > > On Thu, Mar 19, 2020 at 6:03 PM Richard Guy Briggs <rgb@redhat.com> wrote:
>> > > > > > > > > On 2020-03-18 18:06, Paul Moore wrote:
>
> ...
>
>> > > Well, every time a record gets generated, *any* record gets generated,
>> > > we'll need to check for which audit daemons this record is in scope and
>> > > generate a different one for each depending on the content and whether
>> > > or not the content is influenced by the scope.
>> >
>> > That's the problem right there - we don't want to have to generate a
>> > unique record for *each* auditd on *every* record.  That is a recipe
>> > for disaster.
>> >
>> > Solving this for all of the known audit records is not something we
>> > need to worry about in depth at the moment (although giving it some
>> > casual thought is not a bad thing), but solving this for the audit
>> > container ID information *is* something we need to worry about right
>> > now.
>>
>> If you think that a different nested contid value string per daemon is
>> not acceptable, then we are back to issuing a record that has only *one*
>> contid listed without any nesting information.  This brings us back to
>> the original problem of keeping *all* audit log history since the boot
>> of the machine to be able to track the nesting of any particular contid.
>
> I'm not ruling anything out, except for the "let's just completely
> regenerate every record for each auditd instance".

Paul I am a bit confused about what you are referring to when you say
regenerate every record.

Are you saying that you don't want to repeat the sequence:
	audit_log_start(...);
	audit_log_format(...);
	audit_log_end(...);
for every nested audit daemon?

Or are you saying that you would like to literraly want to send the same
skb to each of the nested audit daemons?

Or are you thinking of something else?

Eric
Paul Moore April 16, 2020, 9:53 p.m. UTC | #43
On Thu, Apr 16, 2020 at 4:36 PM Eric W. Biederman <ebiederm@xmission.com> wrote:
> Paul Moore <paul@paul-moore.com> writes:
> > On Mon, Mar 30, 2020 at 1:49 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> >> On 2020-03-30 13:34, Paul Moore wrote:
> >> > On Mon, Mar 30, 2020 at 12:22 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> >> > > On 2020-03-30 10:26, Paul Moore wrote:
> >> > > > On Mon, Mar 30, 2020 at 9:47 AM Richard Guy Briggs <rgb@redhat.com> wrote:
> >> > > > > On 2020-03-28 23:11, Paul Moore wrote:
> >> > > > > > On Tue, Mar 24, 2020 at 5:02 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> >> > > > > > > On 2020-03-23 20:16, Paul Moore wrote:
> >> > > > > > > > On Thu, Mar 19, 2020 at 6:03 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> >> > > > > > > > > On 2020-03-18 18:06, Paul Moore wrote:
> >
> > ...
> >
> >> > > Well, every time a record gets generated, *any* record gets generated,
> >> > > we'll need to check for which audit daemons this record is in scope and
> >> > > generate a different one for each depending on the content and whether
> >> > > or not the content is influenced by the scope.
> >> >
> >> > That's the problem right there - we don't want to have to generate a
> >> > unique record for *each* auditd on *every* record.  That is a recipe
> >> > for disaster.
> >> >
> >> > Solving this for all of the known audit records is not something we
> >> > need to worry about in depth at the moment (although giving it some
> >> > casual thought is not a bad thing), but solving this for the audit
> >> > container ID information *is* something we need to worry about right
> >> > now.
> >>
> >> If you think that a different nested contid value string per daemon is
> >> not acceptable, then we are back to issuing a record that has only *one*
> >> contid listed without any nesting information.  This brings us back to
> >> the original problem of keeping *all* audit log history since the boot
> >> of the machine to be able to track the nesting of any particular contid.
> >
> > I'm not ruling anything out, except for the "let's just completely
> > regenerate every record for each auditd instance".
>
> Paul I am a bit confused about what you are referring to when you say
> regenerate every record.
>
> Are you saying that you don't want to repeat the sequence:
>         audit_log_start(...);
>         audit_log_format(...);
>         audit_log_end(...);
> for every nested audit daemon?

If it can be avoided yes.  Audit performance is already not-awesome,
this would make it even worse.

> Or are you saying that you would like to literraly want to send the same
> skb to each of the nested audit daemons?

Ideally we would reuse the generated audit messages as much as
possible.  Less work is better.  That's really my main concern here,
let's make sure we aren't going to totally tank performance when we
have a bunch of nested audit daemons.

> Or are you thinking of something else?

As mentioned above, I'm not thinking of anything specific, other than
let's please not have to regenerate *all* of the audit record strings
for each instance of an audit daemon, that's going to be a killer.

Maybe we have to regenerate some, if we do, what would that look like
in code?  How do we handle the regeneration aspect?  I worry that is
going to be really ugly.

Maybe we finally burn down the audit_log_format(...) function and pass
structs/TLVs to the audit subsystem and the audit subsystem generates
the strings in the auditd connection thread.  Some of the record
strings could likely be shared, others would need to be ACID/auditd
dependent.

I'm open to any ideas people may have.  We have a problem, let's solve it.
Eric W. Biederman April 17, 2020, 10:23 p.m. UTC | #44
Paul Moore <paul@paul-moore.com> writes:

> On Thu, Apr 16, 2020 at 4:36 PM Eric W. Biederman <ebiederm@xmission.com> wrote:
>> Paul Moore <paul@paul-moore.com> writes:
>> > On Mon, Mar 30, 2020 at 1:49 PM Richard Guy Briggs <rgb@redhat.com> wrote:
>> >> On 2020-03-30 13:34, Paul Moore wrote:
>> >> > On Mon, Mar 30, 2020 at 12:22 PM Richard Guy Briggs <rgb@redhat.com> wrote:
>> >> > > On 2020-03-30 10:26, Paul Moore wrote:
>> >> > > > On Mon, Mar 30, 2020 at 9:47 AM Richard Guy Briggs <rgb@redhat.com> wrote:
>> >> > > > > On 2020-03-28 23:11, Paul Moore wrote:
>> >> > > > > > On Tue, Mar 24, 2020 at 5:02 PM Richard Guy Briggs <rgb@redhat.com> wrote:
>> >> > > > > > > On 2020-03-23 20:16, Paul Moore wrote:
>> >> > > > > > > > On Thu, Mar 19, 2020 at 6:03 PM Richard Guy Briggs <rgb@redhat.com> wrote:
>> >> > > > > > > > > On 2020-03-18 18:06, Paul Moore wrote:
>> >
>> > ...
>> >
>> >> > > Well, every time a record gets generated, *any* record gets generated,
>> >> > > we'll need to check for which audit daemons this record is in scope and
>> >> > > generate a different one for each depending on the content and whether
>> >> > > or not the content is influenced by the scope.
>> >> >
>> >> > That's the problem right there - we don't want to have to generate a
>> >> > unique record for *each* auditd on *every* record.  That is a recipe
>> >> > for disaster.
>> >> >
>> >> > Solving this for all of the known audit records is not something we
>> >> > need to worry about in depth at the moment (although giving it some
>> >> > casual thought is not a bad thing), but solving this for the audit
>> >> > container ID information *is* something we need to worry about right
>> >> > now.
>> >>
>> >> If you think that a different nested contid value string per daemon is
>> >> not acceptable, then we are back to issuing a record that has only *one*
>> >> contid listed without any nesting information.  This brings us back to
>> >> the original problem of keeping *all* audit log history since the boot
>> >> of the machine to be able to track the nesting of any particular contid.
>> >
>> > I'm not ruling anything out, except for the "let's just completely
>> > regenerate every record for each auditd instance".
>>
>> Paul I am a bit confused about what you are referring to when you say
>> regenerate every record.
>>
>> Are you saying that you don't want to repeat the sequence:
>>         audit_log_start(...);
>>         audit_log_format(...);
>>         audit_log_end(...);
>> for every nested audit daemon?
>
> If it can be avoided yes.  Audit performance is already not-awesome,
> this would make it even worse.

As far as I can see not repeating sequences like that is fundamental
for making this work at all.  Just because only the audit subsystem
should know about one or multiple audit daemons.  Nothing else should
care.

>> Or are you saying that you would like to literraly want to send the same
>> skb to each of the nested audit daemons?
>
> Ideally we would reuse the generated audit messages as much as
> possible.  Less work is better.  That's really my main concern here,
> let's make sure we aren't going to totally tank performance when we
> have a bunch of nested audit daemons.

So I think there are two parts of this answer.  Assuming we are talking
about nesting audit daemons in containers we will have different
rulesets and I expect most of the events for a nested audit daemon won't
be of interest to the outer audit daemon.

Beyond that it should be very straight forward to keep a pointer and
leave the buffer as a scatter gather list until audit_log_end
and translate pids, and rewrite ACIDs attributes in audit_log_end
when we build the final packet.  Either through collaboration with
audit_log_format or a special audit_log command that carefully sets
up the handful of things that need that information.

Hmm.  I am seeing that we send skbs to kauditd and then kauditd
sends those skbs to userspace.  I presume that is primary so that
sending messages to userspace does not block the process being audited.

Plus a little bit so that the retry logic will work.

I think the naive implementation would be to simply have 1 kauditd
per auditd (strictly and audit context/namespace).  Although that can be
optimized if that is a problem.

Beyond that I think we would need to look at profiles to really
understand where the bottlenecks are.

>> Or are you thinking of something else?
>
> As mentioned above, I'm not thinking of anything specific, other than
> let's please not have to regenerate *all* of the audit record strings
> for each instance of an audit daemon, that's going to be a killer.
>
> Maybe we have to regenerate some, if we do, what would that look like
> in code?  How do we handle the regeneration aspect?  I worry that is
> going to be really ugly.
>
> Maybe we finally burn down the audit_log_format(...) function and pass
> structs/TLVs to the audit subsystem and the audit subsystem generates
> the strings in the auditd connection thread.  Some of the record
> strings could likely be shared, others would need to be ACID/auditd
> dependent.

I think we just a very limited amount of structs/TLVs for the cases that
matter and one-one auditd and kauditd implementations we should still
be able to do everything in audit_log_end.  Plus doing as much work as
possible in audit_log_end where things are still cache hot is desirable.

> I'm open to any ideas people may have.  We have a problem, let's solve
> it.

It definitely makes sense to look ahead to having audit daemons running
in containers, but in the grand scheme of things that is a nice to have.
Probably something we will and should get to, but we have lived a long
time without auditd running in containers so I expect we can live a
while longer.

As I understand Richard patchset for the specific case of the ACID we
are only talking about taking a subset of an existing string, and one
string at that.  Not hard at all.  Especially when looking at the
fundamental fact that we will need to send a different skb to
userspace, for each audit daemon.

Eric
Paul Moore April 22, 2020, 5:24 p.m. UTC | #45
On Fri, Apr 17, 2020 at 6:26 PM Eric W. Biederman <ebiederm@xmission.com> wrote:
> Paul Moore <paul@paul-moore.com> writes:
> > On Thu, Apr 16, 2020 at 4:36 PM Eric W. Biederman <ebiederm@xmission.com> wrote:
> >> Paul Moore <paul@paul-moore.com> writes:
> >> > On Mon, Mar 30, 2020 at 1:49 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> >> >> On 2020-03-30 13:34, Paul Moore wrote:
> >> >> > On Mon, Mar 30, 2020 at 12:22 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> >> >> > > On 2020-03-30 10:26, Paul Moore wrote:
> >> >> > > > On Mon, Mar 30, 2020 at 9:47 AM Richard Guy Briggs <rgb@redhat.com> wrote:
> >> >> > > > > On 2020-03-28 23:11, Paul Moore wrote:
> >> >> > > > > > On Tue, Mar 24, 2020 at 5:02 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> >> >> > > > > > > On 2020-03-23 20:16, Paul Moore wrote:
> >> >> > > > > > > > On Thu, Mar 19, 2020 at 6:03 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> >> >> > > > > > > > > On 2020-03-18 18:06, Paul Moore wrote:
> >> >
> >> > ...
> >> >
> >> >> > > Well, every time a record gets generated, *any* record gets generated,
> >> >> > > we'll need to check for which audit daemons this record is in scope and
> >> >> > > generate a different one for each depending on the content and whether
> >> >> > > or not the content is influenced by the scope.
> >> >> >
> >> >> > That's the problem right there - we don't want to have to generate a
> >> >> > unique record for *each* auditd on *every* record.  That is a recipe
> >> >> > for disaster.
> >> >> >
> >> >> > Solving this for all of the known audit records is not something we
> >> >> > need to worry about in depth at the moment (although giving it some
> >> >> > casual thought is not a bad thing), but solving this for the audit
> >> >> > container ID information *is* something we need to worry about right
> >> >> > now.
> >> >>
> >> >> If you think that a different nested contid value string per daemon is
> >> >> not acceptable, then we are back to issuing a record that has only *one*
> >> >> contid listed without any nesting information.  This brings us back to
> >> >> the original problem of keeping *all* audit log history since the boot
> >> >> of the machine to be able to track the nesting of any particular contid.
> >> >
> >> > I'm not ruling anything out, except for the "let's just completely
> >> > regenerate every record for each auditd instance".
> >>
> >> Paul I am a bit confused about what you are referring to when you say
> >> regenerate every record.
> >>
> >> Are you saying that you don't want to repeat the sequence:
> >>         audit_log_start(...);
> >>         audit_log_format(...);
> >>         audit_log_end(...);
> >> for every nested audit daemon?
> >
> > If it can be avoided yes.  Audit performance is already not-awesome,
> > this would make it even worse.
>
> As far as I can see not repeating sequences like that is fundamental
> for making this work at all.  Just because only the audit subsystem
> should know about one or multiple audit daemons.  Nothing else should
> care.

Yes, exactly, this has been mentioned in the past.  Both the
performance hit and the code complication in the caller are things we
must avoid.

> >> Or are you saying that you would like to literraly want to send the same
> >> skb to each of the nested audit daemons?
> >
> > Ideally we would reuse the generated audit messages as much as
> > possible.  Less work is better.  That's really my main concern here,
> > let's make sure we aren't going to totally tank performance when we
> > have a bunch of nested audit daemons.
>
> So I think there are two parts of this answer.  Assuming we are talking
> about nesting audit daemons in containers we will have different
> rulesets and I expect most of the events for a nested audit daemon won't
> be of interest to the outer audit daemon.

Yes, this is another thing that Richard and I have discussed in the
past.  We will basically need to create per-daemon queues, rules,
tracking state, etc.; that is easy enough.  What will be slightly more
tricky is the part where we apply the filters to the individual
records and decide if that record is valid/desired for a given daemon.
I think it can be done without too much pain, and any changes to the
callers, but it will require a bit of work to make sure it is done
well and that records are needlessly duplicated in the kernel.

> Beyond that it should be very straight forward to keep a pointer and
> leave the buffer as a scatter gather list until audit_log_end
> and translate pids, and rewrite ACIDs attributes in audit_log_end
> when we build the final packet.  Either through collaboration with
> audit_log_format or a special audit_log command that carefully sets
> up the handful of things that need that information.

In order to maximize record re-use I think we will want to hold off on
assembling the final packet until it is sent to the daemons in the
kauditd thread.  We'll also likely need to create special
audit_log_XXX functions to capture fields which we know will need
translation, e.g. ACID information.  (the reason for the new
audit_log_XXX functions would be to mark the new sg element and ensure
the buffer is handled correctly)

Regardless of the details, I think the scatter gather approach is the
key here - that seems like the best design idea I've seen thus far.
It enables us to replace portions of the record as needed ... and
possibly use the existing skb cow stuff ... it has been a while, but
does the skb cow functions handle scatter gather skbs or do they need
to be linear?

> Hmm.  I am seeing that we send skbs to kauditd and then kauditd
> sends those skbs to userspace.  I presume that is primary so that
> sending messages to userspace does not block the process being audited.
> Plus a little bit so that the retry logic will work.

Long story short, it's a poor design.  I'm not sure who came up with
it, but I have about a 1000 questions that are variations on "why did
this seem like a good idea?".

I expect the audit_buffer definition to change significantly during
the nested auditd work.

> I think the naive implementation would be to simply have 1 kauditd
> per auditd (strictly and audit context/namespace).  Although that can be
> optimized if that is a problem.
>
> Beyond that I think we would need to look at profiles to really
> understand where the bottlenecks are.

Agreed.  This is a hidden implementation detail that doesn't affect
the userspace API or the in-kernel callers.  The first approach can be
simple and we can complicate it as needed in future versions.

> > I'm open to any ideas people may have.  We have a problem, let's solve
> > it.
>
> It definitely makes sense to look ahead to having audit daemons running
> in containers, but in the grand scheme of things that is a nice to have.
> Probably something we will and should get to, but we have lived a long
> time without auditd running in containers so I expect we can live a
> while longer.

It looks like you are confusing my concern.  I'm not pushing Richard
to implement support for this in the current patchset, I'm pushing
Richard to consider the design aspect of having multiple audit daemons
so that we don't code ourselves into a corner with the audit record
changes he is proposing.  The audit record format is part of the
kernel/userspace API and as a result requires great care when
modifying/extending/etc.
Richard Guy Briggs June 8, 2020, 6:03 p.m. UTC | #46
On 2020-04-22 13:24, Paul Moore wrote:
> On Fri, Apr 17, 2020 at 6:26 PM Eric W. Biederman <ebiederm@xmission.com> wrote:
> > Paul Moore <paul@paul-moore.com> writes:
> > > On Thu, Apr 16, 2020 at 4:36 PM Eric W. Biederman <ebiederm@xmission.com> wrote:
> > >> Paul Moore <paul@paul-moore.com> writes:
> > >> > On Mon, Mar 30, 2020 at 1:49 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> > >> >> On 2020-03-30 13:34, Paul Moore wrote:
> > >> >> > On Mon, Mar 30, 2020 at 12:22 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> > >> >> > > On 2020-03-30 10:26, Paul Moore wrote:
> > >> >> > > > On Mon, Mar 30, 2020 at 9:47 AM Richard Guy Briggs <rgb@redhat.com> wrote:
> > >> >> > > > > On 2020-03-28 23:11, Paul Moore wrote:
> > >> >> > > > > > On Tue, Mar 24, 2020 at 5:02 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> > >> >> > > > > > > On 2020-03-23 20:16, Paul Moore wrote:
> > >> >> > > > > > > > On Thu, Mar 19, 2020 at 6:03 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> > >> >> > > > > > > > > On 2020-03-18 18:06, Paul Moore wrote:
> > >> >
> > >> > ...
> > >> >
> > >> >> > > Well, every time a record gets generated, *any* record gets generated,
> > >> >> > > we'll need to check for which audit daemons this record is in scope and
> > >> >> > > generate a different one for each depending on the content and whether
> > >> >> > > or not the content is influenced by the scope.
> > >> >> >
> > >> >> > That's the problem right there - we don't want to have to generate a
> > >> >> > unique record for *each* auditd on *every* record.  That is a recipe
> > >> >> > for disaster.
> > >> >> >
> > >> >> > Solving this for all of the known audit records is not something we
> > >> >> > need to worry about in depth at the moment (although giving it some
> > >> >> > casual thought is not a bad thing), but solving this for the audit
> > >> >> > container ID information *is* something we need to worry about right
> > >> >> > now.
> > >> >>
> > >> >> If you think that a different nested contid value string per daemon is
> > >> >> not acceptable, then we are back to issuing a record that has only *one*
> > >> >> contid listed without any nesting information.  This brings us back to
> > >> >> the original problem of keeping *all* audit log history since the boot
> > >> >> of the machine to be able to track the nesting of any particular contid.
> > >> >
> > >> > I'm not ruling anything out, except for the "let's just completely
> > >> > regenerate every record for each auditd instance".
> > >>
> > >> Paul I am a bit confused about what you are referring to when you say
> > >> regenerate every record.
> > >>
> > >> Are you saying that you don't want to repeat the sequence:
> > >>         audit_log_start(...);
> > >>         audit_log_format(...);
> > >>         audit_log_end(...);
> > >> for every nested audit daemon?
> > >
> > > If it can be avoided yes.  Audit performance is already not-awesome,
> > > this would make it even worse.
> >
> > As far as I can see not repeating sequences like that is fundamental
> > for making this work at all.  Just because only the audit subsystem
> > should know about one or multiple audit daemons.  Nothing else should
> > care.
> 
> Yes, exactly, this has been mentioned in the past.  Both the
> performance hit and the code complication in the caller are things we
> must avoid.
> 
> > >> Or are you saying that you would like to literraly want to send the same
> > >> skb to each of the nested audit daemons?
> > >
> > > Ideally we would reuse the generated audit messages as much as
> > > possible.  Less work is better.  That's really my main concern here,
> > > let's make sure we aren't going to totally tank performance when we
> > > have a bunch of nested audit daemons.
> >
> > So I think there are two parts of this answer.  Assuming we are talking
> > about nesting audit daemons in containers we will have different
> > rulesets and I expect most of the events for a nested audit daemon won't
> > be of interest to the outer audit daemon.
> 
> Yes, this is another thing that Richard and I have discussed in the
> past.  We will basically need to create per-daemon queues, rules,
> tracking state, etc.; that is easy enough.  What will be slightly more
> tricky is the part where we apply the filters to the individual
> records and decide if that record is valid/desired for a given daemon.
> I think it can be done without too much pain, and any changes to the
> callers, but it will require a bit of work to make sure it is done
> well and that records are needlessly duplicated in the kernel.
> 
> > Beyond that it should be very straight forward to keep a pointer and
> > leave the buffer as a scatter gather list until audit_log_end
> > and translate pids, and rewrite ACIDs attributes in audit_log_end
> > when we build the final packet.  Either through collaboration with
> > audit_log_format or a special audit_log command that carefully sets
> > up the handful of things that need that information.
> 
> In order to maximize record re-use I think we will want to hold off on
> assembling the final packet until it is sent to the daemons in the
> kauditd thread.  We'll also likely need to create special
> audit_log_XXX functions to capture fields which we know will need
> translation, e.g. ACID information.  (the reason for the new
> audit_log_XXX functions would be to mark the new sg element and ensure
> the buffer is handled correctly)
> 
> Regardless of the details, I think the scatter gather approach is the
> key here - that seems like the best design idea I've seen thus far.
> It enables us to replace portions of the record as needed ... and
> possibly use the existing skb cow stuff ... it has been a while, but
> does the skb cow functions handle scatter gather skbs or do they need
> to be linear?

How does the selection of this data management technique affect our
choice of field format?  Does this lock the field value to a fixed
length?  Does the use of scatter/gather techniques or structures allow
the use of different lengths of data for each destination (auditd)?  I
could see different target audit daemons triggering or switching to a
different chunk of data and length.  This does raise a concern related
to the previous sig_info2 discussion that the struct contobj that exists
at the time of audit_log_exit called could have been reaped by the time
the buffer is pulled from the queue for transmission to auditd, but we
could hold a reference to it as is done for sig_info2.

Looking through the kernel scatter/gather possibilities, I see struct
iovec which is used by the readv/writev/preadv/pwritev syscalls, but I'm
understanding that this is a kernel implementation that will be not
visible to user space.  So would the struct scatterlist be the right
choice?

> paul moore

- RGB

--
Richard Guy Briggs <rgb@redhat.com>
Sr. S/W Engineer, Kernel Security, Base Operating Systems
Remote, Ottawa, Red Hat Canada
IRC: rgb, SunRaycer
Voice: +1.647.777.2635, Internal: (81) 32635
Paul Moore June 17, 2020, 9:33 p.m. UTC | #47
On Mon, Jun 8, 2020 at 2:04 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> On 2020-04-22 13:24, Paul Moore wrote:
> > On Fri, Apr 17, 2020 at 6:26 PM Eric W. Biederman <ebiederm@xmission.com> wrote:
> > > Paul Moore <paul@paul-moore.com> writes:
> > > > On Thu, Apr 16, 2020 at 4:36 PM Eric W. Biederman <ebiederm@xmission.com> wrote:
> > > >> Paul Moore <paul@paul-moore.com> writes:
> > > >> > On Mon, Mar 30, 2020 at 1:49 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> > > >> >> On 2020-03-30 13:34, Paul Moore wrote:
> > > >> >> > On Mon, Mar 30, 2020 at 12:22 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> > > >> >> > > On 2020-03-30 10:26, Paul Moore wrote:
> > > >> >> > > > On Mon, Mar 30, 2020 at 9:47 AM Richard Guy Briggs <rgb@redhat.com> wrote:
> > > >> >> > > > > On 2020-03-28 23:11, Paul Moore wrote:
> > > >> >> > > > > > On Tue, Mar 24, 2020 at 5:02 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> > > >> >> > > > > > > On 2020-03-23 20:16, Paul Moore wrote:
> > > >> >> > > > > > > > On Thu, Mar 19, 2020 at 6:03 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> > > >> >> > > > > > > > > On 2020-03-18 18:06, Paul Moore wrote:
> > > >> >
> > > >> > ...
> > > >> >
> > > >> >> > > Well, every time a record gets generated, *any* record gets generated,
> > > >> >> > > we'll need to check for which audit daemons this record is in scope and
> > > >> >> > > generate a different one for each depending on the content and whether
> > > >> >> > > or not the content is influenced by the scope.
> > > >> >> >
> > > >> >> > That's the problem right there - we don't want to have to generate a
> > > >> >> > unique record for *each* auditd on *every* record.  That is a recipe
> > > >> >> > for disaster.
> > > >> >> >
> > > >> >> > Solving this for all of the known audit records is not something we
> > > >> >> > need to worry about in depth at the moment (although giving it some
> > > >> >> > casual thought is not a bad thing), but solving this for the audit
> > > >> >> > container ID information *is* something we need to worry about right
> > > >> >> > now.
> > > >> >>
> > > >> >> If you think that a different nested contid value string per daemon is
> > > >> >> not acceptable, then we are back to issuing a record that has only *one*
> > > >> >> contid listed without any nesting information.  This brings us back to
> > > >> >> the original problem of keeping *all* audit log history since the boot
> > > >> >> of the machine to be able to track the nesting of any particular contid.
> > > >> >
> > > >> > I'm not ruling anything out, except for the "let's just completely
> > > >> > regenerate every record for each auditd instance".
> > > >>
> > > >> Paul I am a bit confused about what you are referring to when you say
> > > >> regenerate every record.
> > > >>
> > > >> Are you saying that you don't want to repeat the sequence:
> > > >>         audit_log_start(...);
> > > >>         audit_log_format(...);
> > > >>         audit_log_end(...);
> > > >> for every nested audit daemon?
> > > >
> > > > If it can be avoided yes.  Audit performance is already not-awesome,
> > > > this would make it even worse.
> > >
> > > As far as I can see not repeating sequences like that is fundamental
> > > for making this work at all.  Just because only the audit subsystem
> > > should know about one or multiple audit daemons.  Nothing else should
> > > care.
> >
> > Yes, exactly, this has been mentioned in the past.  Both the
> > performance hit and the code complication in the caller are things we
> > must avoid.
> >
> > > >> Or are you saying that you would like to literraly want to send the same
> > > >> skb to each of the nested audit daemons?
> > > >
> > > > Ideally we would reuse the generated audit messages as much as
> > > > possible.  Less work is better.  That's really my main concern here,
> > > > let's make sure we aren't going to totally tank performance when we
> > > > have a bunch of nested audit daemons.
> > >
> > > So I think there are two parts of this answer.  Assuming we are talking
> > > about nesting audit daemons in containers we will have different
> > > rulesets and I expect most of the events for a nested audit daemon won't
> > > be of interest to the outer audit daemon.
> >
> > Yes, this is another thing that Richard and I have discussed in the
> > past.  We will basically need to create per-daemon queues, rules,
> > tracking state, etc.; that is easy enough.  What will be slightly more
> > tricky is the part where we apply the filters to the individual
> > records and decide if that record is valid/desired for a given daemon.
> > I think it can be done without too much pain, and any changes to the
> > callers, but it will require a bit of work to make sure it is done
> > well and that records are needlessly duplicated in the kernel.
> >
> > > Beyond that it should be very straight forward to keep a pointer and
> > > leave the buffer as a scatter gather list until audit_log_end
> > > and translate pids, and rewrite ACIDs attributes in audit_log_end
> > > when we build the final packet.  Either through collaboration with
> > > audit_log_format or a special audit_log command that carefully sets
> > > up the handful of things that need that information.
> >
> > In order to maximize record re-use I think we will want to hold off on
> > assembling the final packet until it is sent to the daemons in the
> > kauditd thread.  We'll also likely need to create special
> > audit_log_XXX functions to capture fields which we know will need
> > translation, e.g. ACID information.  (the reason for the new
> > audit_log_XXX functions would be to mark the new sg element and ensure
> > the buffer is handled correctly)
> >
> > Regardless of the details, I think the scatter gather approach is the
> > key here - that seems like the best design idea I've seen thus far.
> > It enables us to replace portions of the record as needed ... and
> > possibly use the existing skb cow stuff ... it has been a while, but
> > does the skb cow functions handle scatter gather skbs or do they need
> > to be linear?
>
> How does the selection of this data management technique affect our
> choice of field format?

I'm not sure it affects the record string, but it might affect the
in-kernel API as we would likely want to have a special function for
logging the audit container ID that does the scatter-gather management
for the record.  There might also need to be some changes to how we
allocate the records.

However, since you're the one working on these patches I would expect
you to be the one to look into how this would work and what the
impacts might be to the code, record format, etc.

> Does this lock the field value to a fixed length?

I wouldn't think so.  In fact if it did it wouldn't really be a good solution.

Once again, this is something I would expect you to look into.

> Does the use of scatter/gather techniques or structures allow
> the use of different lengths of data for each destination (auditd)?

This is related to the above ... but yes, the reason why Eric and I
were discussing a scatter/gather approach is that it would presumably
allow one to break the single record string into pieces which could be
managed and manipulated much easier than the monolithic record string.

> I could see different target audit daemons triggering or switching to a
> different chunk of data and length.  This does raise a concern related
> to the previous sig_info2 discussion that the struct contobj that exists
> at the time of audit_log_exit called could have been reaped by the time
> the buffer is pulled from the queue for transmission to auditd, but we
> could hold a reference to it as is done for sig_info2.

Yes.

> Looking through the kernel scatter/gather possibilities, I see struct
> iovec which is used by the readv/writev/preadv/pwritev syscalls, but I'm
> understanding that this is a kernel implementation that will be not
> visible to user space.  So would the struct scatterlist be the right
> choice?

It has been so long since I've looked at the scatter-gather code that
I can't really say with any confidence at this point.  All I can say
is that the scatter-gather code really should just be an
implementation detail in the kernel and should not be visible to
userspace; userspace should get the same awful, improperly generated
netlink message it always has received from the kernel ;)
Richard Guy Briggs June 19, 2020, 3:22 p.m. UTC | #48
On 2020-04-17 17:23, Eric W. Biederman wrote:
> Paul Moore <paul@paul-moore.com> writes:
> 
> > On Thu, Apr 16, 2020 at 4:36 PM Eric W. Biederman <ebiederm@xmission.com> wrote:
> >> Paul Moore <paul@paul-moore.com> writes:
> >> > On Mon, Mar 30, 2020 at 1:49 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> >> >> On 2020-03-30 13:34, Paul Moore wrote:
> >> >> > On Mon, Mar 30, 2020 at 12:22 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> >> >> > > On 2020-03-30 10:26, Paul Moore wrote:
> >> >> > > > On Mon, Mar 30, 2020 at 9:47 AM Richard Guy Briggs <rgb@redhat.com> wrote:
> >> >> > > > > On 2020-03-28 23:11, Paul Moore wrote:
> >> >> > > > > > On Tue, Mar 24, 2020 at 5:02 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> >> >> > > > > > > On 2020-03-23 20:16, Paul Moore wrote:
> >> >> > > > > > > > On Thu, Mar 19, 2020 at 6:03 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> >> >> > > > > > > > > On 2020-03-18 18:06, Paul Moore wrote:
> >> >
> >> > ...
> >> >
> >> >> > > Well, every time a record gets generated, *any* record gets generated,
> >> >> > > we'll need to check for which audit daemons this record is in scope and
> >> >> > > generate a different one for each depending on the content and whether
> >> >> > > or not the content is influenced by the scope.
> >> >> >
> >> >> > That's the problem right there - we don't want to have to generate a
> >> >> > unique record for *each* auditd on *every* record.  That is a recipe
> >> >> > for disaster.
> >> >> >
> >> >> > Solving this for all of the known audit records is not something we
> >> >> > need to worry about in depth at the moment (although giving it some
> >> >> > casual thought is not a bad thing), but solving this for the audit
> >> >> > container ID information *is* something we need to worry about right
> >> >> > now.
> >> >>
> >> >> If you think that a different nested contid value string per daemon is
> >> >> not acceptable, then we are back to issuing a record that has only *one*
> >> >> contid listed without any nesting information.  This brings us back to
> >> >> the original problem of keeping *all* audit log history since the boot
> >> >> of the machine to be able to track the nesting of any particular contid.
> >> >
> >> > I'm not ruling anything out, except for the "let's just completely
> >> > regenerate every record for each auditd instance".
> >>
> >> Paul I am a bit confused about what you are referring to when you say
> >> regenerate every record.
> >>
> >> Are you saying that you don't want to repeat the sequence:
> >>         audit_log_start(...);
> >>         audit_log_format(...);
> >>         audit_log_end(...);
> >> for every nested audit daemon?
> >
> > If it can be avoided yes.  Audit performance is already not-awesome,
> > this would make it even worse.
> 
> As far as I can see not repeating sequences like that is fundamental
> for making this work at all.  Just because only the audit subsystem
> should know about one or multiple audit daemons.  Nothing else should
> care.
> 
> >> Or are you saying that you would like to literraly want to send the same
> >> skb to each of the nested audit daemons?
> >
> > Ideally we would reuse the generated audit messages as much as
> > possible.  Less work is better.  That's really my main concern here,
> > let's make sure we aren't going to totally tank performance when we
> > have a bunch of nested audit daemons.
> 
> So I think there are two parts of this answer.  Assuming we are talking
> about nesting audit daemons in containers we will have different
> rulesets and I expect most of the events for a nested audit daemon won't
> be of interest to the outer audit daemon.
> 
> Beyond that it should be very straight forward to keep a pointer and
> leave the buffer as a scatter gather list until audit_log_end
> and translate pids, and rewrite ACIDs attributes in audit_log_end
> when we build the final packet.  Either through collaboration with
> audit_log_format or a special audit_log command that carefully sets
> up the handful of things that need that information.
> 
> Hmm.  I am seeing that we send skbs to kauditd and then kauditd
> sends those skbs to userspace.  I presume that is primary so that
> sending messages to userspace does not block the process being audited.
> 
> Plus a little bit so that the retry logic will work.
> 
> I think the naive implementation would be to simply have 1 kauditd
> per auditd (strictly and audit context/namespace).  Although that can be
> optimized if that is a problem.
> 
> Beyond that I think we would need to look at profiles to really
> understand where the bottlenecks are.
> 
> >> Or are you thinking of something else?
> >
> > As mentioned above, I'm not thinking of anything specific, other than
> > let's please not have to regenerate *all* of the audit record strings
> > for each instance of an audit daemon, that's going to be a killer.
> >
> > Maybe we have to regenerate some, if we do, what would that look like
> > in code?  How do we handle the regeneration aspect?  I worry that is
> > going to be really ugly.
> >
> > Maybe we finally burn down the audit_log_format(...) function and pass
> > structs/TLVs to the audit subsystem and the audit subsystem generates
> > the strings in the auditd connection thread.  Some of the record
> > strings could likely be shared, others would need to be ACID/auditd
> > dependent.
> 
> I think we just a very limited amount of structs/TLVs for the cases that
> matter and one-one auditd and kauditd implementations we should still
> be able to do everything in audit_log_end.  Plus doing as much work as
> possible in audit_log_end where things are still cache hot is desirable.

So in the end, perf may show us that moving things around a bit and
knowing to which queue(s) we send an skb will help maintain performance
by writing out the field contents in audit_log_end() and sending to the
correct queue rather than deferring writing out that field contents in
the kauditd process due to cache issues.  In any case, it makes sense to
delay that formatting work until just after the daemon routing decision
is made.

> > I'm open to any ideas people may have.  We have a problem, let's solve
> > it.
> 
> It definitely makes sense to look ahead to having audit daemons running
> in containers, but in the grand scheme of things that is a nice to have.
> Probably something we will and should get to, but we have lived a long
> time without auditd running in containers so I expect we can live a
> while longer.
> 
> As I understand Richard patchset for the specific case of the ACID we
> are only talking about taking a subset of an existing string, and one
> string at that.  Not hard at all.  Especially when looking at the
> fundamental fact that we will need to send a different skb to
> userspace, for each audit daemon.
> 
> Eric

- RGB

--
Richard Guy Briggs <rgb@redhat.com>
Sr. S/W Engineer, Kernel Security, Base Operating Systems
Remote, Ottawa, Red Hat Canada
IRC: rgb, SunRaycer
Voice: +1.647.777.2635, Internal: (81) 32635
diff mbox series

Patch

diff --git a/include/linux/audit.h b/include/linux/audit.h
index 2636b0ad0011..6929a02080f7 100644
--- a/include/linux/audit.h
+++ b/include/linux/audit.h
@@ -22,6 +22,13 @@  struct audit_sig_info {
 	char		ctx[0];
 };
 
+struct audit_sig_info2 {
+	uid_t		uid;
+	pid_t		pid;
+	u64		cid;
+	char		ctx[0];
+};
+
 struct audit_buffer;
 struct audit_context;
 struct inode;
diff --git a/include/uapi/linux/audit.h b/include/uapi/linux/audit.h
index 93417a8af9d0..4f87b06f0acd 100644
--- a/include/uapi/linux/audit.h
+++ b/include/uapi/linux/audit.h
@@ -72,6 +72,7 @@ 
 #define AUDIT_SET_FEATURE	1018	/* Turn an audit feature on or off */
 #define AUDIT_GET_FEATURE	1019	/* Get which features are enabled */
 #define AUDIT_CONTAINER_OP	1020	/* Define the container id and info */
+#define AUDIT_SIGNAL_INFO2	1021	/* Get info auditd signal sender */
 
 #define AUDIT_FIRST_USER_MSG	1100	/* Userspace messages mostly uninteresting to kernel */
 #define AUDIT_USER_AVC		1107	/* We filter this differently */
diff --git a/kernel/audit.c b/kernel/audit.c
index 0871c3e5d6df..51159c94041c 100644
--- a/kernel/audit.c
+++ b/kernel/audit.c
@@ -126,6 +126,14 @@  struct auditd_connection {
 kuid_t		audit_sig_uid = INVALID_UID;
 pid_t		audit_sig_pid = -1;
 u32		audit_sig_sid = 0;
+/* Since the signal information is stored in the record buffer at the
+ * time of the signal, but not retrieved until later, there is a chance
+ * that the last process in the container could terminate before the
+ * signal record is delivered.  In this circumstance, there is a chance
+ * the orchestrator could reuse the audit container identifier, causing
+ * an overlap of audit records that refer to the same audit container
+ * identifier, but a different container instance.  */
+u64		audit_sig_cid = AUDIT_CID_UNSET;
 
 /* Records can be lost in several ways:
    0) [suppressed in audit_alloc]
@@ -1123,6 +1131,7 @@  static int audit_netlink_ok(struct sk_buff *skb, u16 msg_type)
 	case AUDIT_ADD_RULE:
 	case AUDIT_DEL_RULE:
 	case AUDIT_SIGNAL_INFO:
+	case AUDIT_SIGNAL_INFO2:
 	case AUDIT_TTY_GET:
 	case AUDIT_TTY_SET:
 	case AUDIT_TRIM:
@@ -1286,6 +1295,7 @@  static int audit_receive_msg(struct sk_buff *skb, struct nlmsghdr *nlh)
 	struct audit_buffer	*ab;
 	u16			msg_type = nlh->nlmsg_type;
 	struct audit_sig_info   *sig_data;
+	struct audit_sig_info2  *sig_data2;
 	char			*ctx = NULL;
 	u32			len;
 
@@ -1545,6 +1555,30 @@  static int audit_receive_msg(struct sk_buff *skb, struct nlmsghdr *nlh)
 				 sig_data, sizeof(*sig_data) + len);
 		kfree(sig_data);
 		break;
+	case AUDIT_SIGNAL_INFO2:
+		len = 0;
+		if (audit_sig_sid) {
+			err = security_secid_to_secctx(audit_sig_sid, &ctx, &len);
+			if (err)
+				return err;
+		}
+		sig_data2 = kmalloc(sizeof(*sig_data2) + len, GFP_KERNEL);
+		if (!sig_data2) {
+			if (audit_sig_sid)
+				security_release_secctx(ctx, len);
+			return -ENOMEM;
+		}
+		sig_data2->uid = from_kuid(&init_user_ns, audit_sig_uid);
+		sig_data2->pid = audit_sig_pid;
+		if (audit_sig_sid) {
+			memcpy(sig_data2->ctx, ctx, len);
+			security_release_secctx(ctx, len);
+		}
+		sig_data2->cid = audit_sig_cid;
+		audit_send_reply(skb, seq, AUDIT_SIGNAL_INFO2, 0, 0,
+				 sig_data2, sizeof(*sig_data2) + len);
+		kfree(sig_data2);
+		break;
 	case AUDIT_TTY_GET: {
 		struct audit_tty_status s;
 		unsigned int t;
@@ -2414,6 +2448,7 @@  int audit_signal_info(int sig, struct task_struct *t)
 		else
 			audit_sig_uid = uid;
 		security_task_getsecid(current, &audit_sig_sid);
+		audit_sig_cid = audit_get_contid(current);
 	}
 
 	return audit_signal_info_syscall(t);
diff --git a/kernel/audit.h b/kernel/audit.h
index 162de8366b32..de358ac61587 100644
--- a/kernel/audit.h
+++ b/kernel/audit.h
@@ -346,6 +346,7 @@  static inline int audit_signal_info_syscall(struct task_struct *t)
 extern pid_t audit_sig_pid;
 extern kuid_t audit_sig_uid;
 extern u32 audit_sig_sid;
+extern u64 audit_sig_cid;
 
 extern int audit_filter(int msgtype, unsigned int listtype);
 
diff --git a/security/selinux/nlmsgtab.c b/security/selinux/nlmsgtab.c
index c97fdae8f71b..f006d8b70b65 100644
--- a/security/selinux/nlmsgtab.c
+++ b/security/selinux/nlmsgtab.c
@@ -134,6 +134,7 @@  struct nlmsg_perm {
 	{ AUDIT_DEL_RULE,	NETLINK_AUDIT_SOCKET__NLMSG_WRITE    },
 	{ AUDIT_USER,		NETLINK_AUDIT_SOCKET__NLMSG_RELAY    },
 	{ AUDIT_SIGNAL_INFO,	NETLINK_AUDIT_SOCKET__NLMSG_READ     },
+	{ AUDIT_SIGNAL_INFO2,	NETLINK_AUDIT_SOCKET__NLMSG_READ     },
 	{ AUDIT_TRIM,		NETLINK_AUDIT_SOCKET__NLMSG_WRITE    },
 	{ AUDIT_MAKE_EQUIV,	NETLINK_AUDIT_SOCKET__NLMSG_WRITE    },
 	{ AUDIT_TTY_GET,	NETLINK_AUDIT_SOCKET__NLMSG_READ     },