mbox series

[RFC,0/2] tracing/user_events: Remote write ABI

Message ID 20221027224011.2075-1-beaub@linux.microsoft.com (mailing list archive)
Headers show
Series tracing/user_events: Remote write ABI | expand

Message

Beau Belgrave Oct. 27, 2022, 10:40 p.m. UTC
As part of the discussions for user_events aligned with user space
tracers, it was determined that user programs should register a 32-bit
value to set or clear a bit when an event becomes enabled. Currently a
shared page is being used that requires mmap().

In this new model during the event registration from user programs 2 new
values are specified. The first is the address to update when the event
is either enabled or disabled. The second is the bit to set/clear to
reflect the event being enabled. This allows for a local 32-bit value in
user programs to support both kernel and user tracers. As an example,
setting bit 31 for kernel tracers when the event becomes enabled allows
for user tracers to use the other bits for ref counts or other flags.
The kernel side updates the bit atomically, user programs need to also
update these values atomically.

User provided addresses must be aligned on a 32-bit boundary, this
allows for single page checking and prevents odd behaviors such as a
32-bit value straddling 2 pages instead of a single page.

When page faults are encountered they are done asyncly via a workqueue.
If the page faults back in, the write update is attempted again. If the
page cannot fault-in, then we log and wait until the next time the event
is enabled/disabled. This is to prevent possible infinite loops resulting
from bad user processes unmapping or changing protection values after
registering the address.

NOTE:
User programs that wish to have the enable bit shared across forks
either need to use a MAP_SHARED allocated address or register a new
address and file descriptor. If MAP_SHARED cannot be used or new
registrations cannot be done, then it's allowable to use MAP_PRIVATE
as long as the forked children never update the page themselves. Once
the page has been updated, the page from the parent will be copied over
to the child. This new copy-on-write page will not receive updates from
the kernel until another registration has been performed with this new
address.

Beau Belgrave (2):
  tracing/user_events: Use remote writes for event enablement
  tracing/user_events: Fixup enable faults asyncly

 include/linux/user_events.h      |  10 +-
 kernel/trace/trace_events_user.c | 396 ++++++++++++++++++++-----------
 2 files changed, 270 insertions(+), 136 deletions(-)


base-commit: 23758867219c8d84c8363316e6dd2f9fd7ae3049

Comments

Mathieu Desnoyers Oct. 28, 2022, 9:50 p.m. UTC | #1
On 2022-10-27 18:40, Beau Belgrave wrote:
> As part of the discussions for user_events aligned with user space
> tracers, it was determined that user programs should register a 32-bit
> value to set or clear a bit when an event becomes enabled. Currently a
> shared page is being used that requires mmap().
> 
> In this new model during the event registration from user programs 2 new
> values are specified. The first is the address to update when the event
> is either enabled or disabled. The second is the bit to set/clear to
> reflect the event being enabled. This allows for a local 32-bit value in
> user programs to support both kernel and user tracers. As an example,
> setting bit 31 for kernel tracers when the event becomes enabled allows
> for user tracers to use the other bits for ref counts or other flags.
> The kernel side updates the bit atomically, user programs need to also
> update these values atomically.

Nice!

> 
> User provided addresses must be aligned on a 32-bit boundary, this
> allows for single page checking and prevents odd behaviors such as a
> 32-bit value straddling 2 pages instead of a single page.
> 
> When page faults are encountered they are done asyncly via a workqueue.
> If the page faults back in, the write update is attempted again. If the
> page cannot fault-in, then we log and wait until the next time the event
> is enabled/disabled. This is to prevent possible infinite loops resulting
> from bad user processes unmapping or changing protection values after
> registering the address.

I'll have a close look at this workqueue page fault scheme, probably 
next week.

> 
> NOTE:
> User programs that wish to have the enable bit shared across forks
> either need to use a MAP_SHARED allocated address or register a new
> address and file descriptor. If MAP_SHARED cannot be used or new
> registrations cannot be done, then it's allowable to use MAP_PRIVATE
> as long as the forked children never update the page themselves. Once
> the page has been updated, the page from the parent will be copied over
> to the child. This new copy-on-write page will not receive updates from
> the kernel until another registration has been performed with this new
> address.

This seems rather odd. I would expect that if a parent process registers 
some instrumentation using private mappings for enabled state through 
the user events ioctl, and then forks, the child process would 
seamlessly be traced by the user events ABI while being able to also 
change the enabled state from the userspace tracer libraries (which 
would trigger COW). Requiring the child to re-register to user events is 
rather odd.

What is preventing us from tracing the child without re-registration in 
this scenario ?

Thanks,

Mathieu

> 
> Beau Belgrave (2):
>    tracing/user_events: Use remote writes for event enablement
>    tracing/user_events: Fixup enable faults asyncly
> 
>   include/linux/user_events.h      |  10 +-
>   kernel/trace/trace_events_user.c | 396 ++++++++++++++++++++-----------
>   2 files changed, 270 insertions(+), 136 deletions(-)
> 
> 
> base-commit: 23758867219c8d84c8363316e6dd2f9fd7ae3049
Beau Belgrave Oct. 28, 2022, 10:17 p.m. UTC | #2
On Fri, Oct 28, 2022 at 05:50:04PM -0400, Mathieu Desnoyers wrote:
> On 2022-10-27 18:40, Beau Belgrave wrote:
> > As part of the discussions for user_events aligned with user space
> > tracers, it was determined that user programs should register a 32-bit
> > value to set or clear a bit when an event becomes enabled. Currently a
> > shared page is being used that requires mmap().
> > 
> > In this new model during the event registration from user programs 2 new
> > values are specified. The first is the address to update when the event
> > is either enabled or disabled. The second is the bit to set/clear to
> > reflect the event being enabled. This allows for a local 32-bit value in
> > user programs to support both kernel and user tracers. As an example,
> > setting bit 31 for kernel tracers when the event becomes enabled allows
> > for user tracers to use the other bits for ref counts or other flags.
> > The kernel side updates the bit atomically, user programs need to also
> > update these values atomically.
> 
> Nice!
> 
> > 
> > User provided addresses must be aligned on a 32-bit boundary, this
> > allows for single page checking and prevents odd behaviors such as a
> > 32-bit value straddling 2 pages instead of a single page.
> > 
> > When page faults are encountered they are done asyncly via a workqueue.
> > If the page faults back in, the write update is attempted again. If the
> > page cannot fault-in, then we log and wait until the next time the event
> > is enabled/disabled. This is to prevent possible infinite loops resulting
> > from bad user processes unmapping or changing protection values after
> > registering the address.
> 
> I'll have a close look at this workqueue page fault scheme, probably next
> week.
> 

Excellent.

> > 
> > NOTE:
> > User programs that wish to have the enable bit shared across forks
> > either need to use a MAP_SHARED allocated address or register a new
> > address and file descriptor. If MAP_SHARED cannot be used or new
> > registrations cannot be done, then it's allowable to use MAP_PRIVATE
> > as long as the forked children never update the page themselves. Once
> > the page has been updated, the page from the parent will be copied over
> > to the child. This new copy-on-write page will not receive updates from
> > the kernel until another registration has been performed with this new
> > address.
> 
> This seems rather odd. I would expect that if a parent process registers
> some instrumentation using private mappings for enabled state through the
> user events ioctl, and then forks, the child process would seamlessly be
> traced by the user events ABI while being able to also change the enabled
> state from the userspace tracer libraries (which would trigger COW).
> Requiring the child to re-register to user events is rather odd.
> 

It's the COW that is the problem, see below.

> What is preventing us from tracing the child without re-registration in this
> scenario ?
> 

Largely knowing when the COW occurs on a specific page. We don't make
the mappings, so I'm unsure if we can ask to be notified easily during
these times or not. If we could, that would solve this. I'm glad you are
thinking about this. The note here was exactly to trigger this
discussion :)

I believe this is the same as a Futex, I'll take another look at that
code to see if they've come up with anything regarding this.

Any ideas?

Thanks,
-Beau
Mathieu Desnoyers Oct. 29, 2022, 1:58 p.m. UTC | #3
On 2022-10-28 18:17, Beau Belgrave wrote:
> On Fri, Oct 28, 2022 at 05:50:04PM -0400, Mathieu Desnoyers wrote:
>> On 2022-10-27 18:40, Beau Belgrave wrote:

[...]
> 
>>>
>>> NOTE:
>>> User programs that wish to have the enable bit shared across forks
>>> either need to use a MAP_SHARED allocated address or register a new
>>> address and file descriptor. If MAP_SHARED cannot be used or new
>>> registrations cannot be done, then it's allowable to use MAP_PRIVATE
>>> as long as the forked children never update the page themselves. Once
>>> the page has been updated, the page from the parent will be copied over
>>> to the child. This new copy-on-write page will not receive updates from
>>> the kernel until another registration has been performed with this new
>>> address.
>>
>> This seems rather odd. I would expect that if a parent process registers
>> some instrumentation using private mappings for enabled state through the
>> user events ioctl, and then forks, the child process would seamlessly be
>> traced by the user events ABI while being able to also change the enabled
>> state from the userspace tracer libraries (which would trigger COW).
>> Requiring the child to re-register to user events is rather odd.
>>
> 
> It's the COW that is the problem, see below.
> 
>> What is preventing us from tracing the child without re-registration in this
>> scenario ?
>>
> 
> Largely knowing when the COW occurs on a specific page. We don't make
> the mappings, so I'm unsure if we can ask to be notified easily during
> these times or not. If we could, that would solve this. I'm glad you are
> thinking about this. The note here was exactly to trigger this
> discussion :)
> 
> I believe this is the same as a Futex, I'll take another look at that
> code to see if they've come up with anything regarding this.
> 
> Any ideas?

Based on your description of the symptoms, AFAIU, upon registration of a 
given user event associated with a mm_struct, the user events ioctl 
appears to translates the virtual address into a page pointer 
immediately, and keeps track of that page afterwards. This means it 
loses track of the page when COW occurs.

Why not keep track of the registered virtual address and struct_mm 
associated with the event rather than the page ? Whenever a state change 
is needed, the virtual-address-to-page translation will be performed 
again. If it follows a COW, it will get the new copied page. If it 
happens that no COW was done, it should map to the original page. If the 
mapping is shared, the kernel would update that shared page. If the 
mapping is private, then the kernel would COW the page before updating it.

Thoughts ?

Thanks,

Mathieu

> 
> Thanks,
> -Beau
Masami Hiramatsu (Google) Oct. 31, 2022, 2:15 p.m. UTC | #4
Hi Beau,

On Thu, 27 Oct 2022 15:40:09 -0700
Beau Belgrave <beaub@linux.microsoft.com> wrote:

> As part of the discussions for user_events aligned with user space
> tracers, it was determined that user programs should register a 32-bit
> value to set or clear a bit when an event becomes enabled. Currently a
> shared page is being used that requires mmap().
> 
> In this new model during the event registration from user programs 2 new
> values are specified. The first is the address to update when the event
> is either enabled or disabled. The second is the bit to set/clear to
> reflect the event being enabled. This allows for a local 32-bit value in
> user programs to support both kernel and user tracers. As an example,
> setting bit 31 for kernel tracers when the event becomes enabled allows
> for user tracers to use the other bits for ref counts or other flags.
> The kernel side updates the bit atomically, user programs need to also
> update these values atomically.

I think you means the kernel tracer (ftrace/perf) and user tracers (e.g. 
LTTng) use the same 32bit data so that traced user-application only checks
that data for checking an event is enabled, right?

If so, who the user tracer threads updates the data bit? Is that thread
safe to update both kernel tracer and user tracers at the same time?

And what is the actual advantage of this change? Are there any issue
to use mmaped page? I would like to know more background of this
change.

Could you also provide any sample program which I can play it? :)

> User provided addresses must be aligned on a 32-bit boundary, this
> allows for single page checking and prevents odd behaviors such as a
> 32-bit value straddling 2 pages instead of a single page.
> 
> When page faults are encountered they are done asyncly via a workqueue.
> If the page faults back in, the write update is attempted again. If the
> page cannot fault-in, then we log and wait until the next time the event
> is enabled/disabled. This is to prevent possible infinite loops resulting
> from bad user processes unmapping or changing protection values after
> registering the address.
> 
> NOTE:
> User programs that wish to have the enable bit shared across forks
> either need to use a MAP_SHARED allocated address or register a new
> address and file descriptor. If MAP_SHARED cannot be used or new
> registrations cannot be done, then it's allowable to use MAP_PRIVATE
> as long as the forked children never update the page themselves. Once
> the page has been updated, the page from the parent will be copied over
> to the child. This new copy-on-write page will not receive updates from
> the kernel until another registration has been performed with this new
> address.
> 
> Beau Belgrave (2):
>   tracing/user_events: Use remote writes for event enablement
>   tracing/user_events: Fixup enable faults asyncly
> 
>  include/linux/user_events.h      |  10 +-
>  kernel/trace/trace_events_user.c | 396 ++++++++++++++++++++-----------
>  2 files changed, 270 insertions(+), 136 deletions(-)
> 
> 
> base-commit: 23758867219c8d84c8363316e6dd2f9fd7ae3049
> -- 
> 2.25.1
>
Mathieu Desnoyers Oct. 31, 2022, 3:27 p.m. UTC | #5
On 2022-10-31 10:15, Masami Hiramatsu (Google) wrote:
> Hi Beau,
> 
> On Thu, 27 Oct 2022 15:40:09 -0700
> Beau Belgrave <beaub@linux.microsoft.com> wrote:
> 
>> As part of the discussions for user_events aligned with user space
>> tracers, it was determined that user programs should register a 32-bit
>> value to set or clear a bit when an event becomes enabled. Currently a
>> shared page is being used that requires mmap().
>>
>> In this new model during the event registration from user programs 2 new
>> values are specified. The first is the address to update when the event
>> is either enabled or disabled. The second is the bit to set/clear to
>> reflect the event being enabled. This allows for a local 32-bit value in
>> user programs to support both kernel and user tracers. As an example,
>> setting bit 31 for kernel tracers when the event becomes enabled allows
>> for user tracers to use the other bits for ref counts or other flags.
>> The kernel side updates the bit atomically, user programs need to also
>> update these values atomically.
> 
> I think you means the kernel tracer (ftrace/perf) and user tracers (e.g.
> LTTng) use the same 32bit data so that traced user-application only checks
> that data for checking an event is enabled, right?
> 
> If so, who the user tracer threads updates the data bit? Is that thread
> safe to update both kernel tracer and user tracers at the same time?

Yes, my plan is to have userspace tracer agent threads use atomic 
increments/decrements to update the "enabled" state, and the kernel use 
atomic bit set/clear to update the top bit. This should allow the state 
to be updated concurrently without issues.

> 
> And what is the actual advantage of this change? Are there any issue
> to use mmaped page? I would like to know more background of this
> change.

With this change we can allow a user-space process to manage userspace 
tracing on its own, without any kernel support. Registering to user 
events becomes entirely optional. So if a kernel does not provide user 
events (either an old kernel or a kernel with CONFIG_USER_EVENTS=n), 
userspace tracing still works.

This also allows user-space tracers to co-exist with the user events ABI.

> 
> Could you also provide any sample program which I can play it? :)

I've been working on a user-space static instrumentation library in the 
recent weeks. I've left "TODO" items for integration with user events 
ioctl/writev in the userspace code. See

https://github.com/compudj/side

There is now a build dependency on librseq to provide fast RCU read-side 
to iterate on the array of userspace tracer callbacks:

https://github.com/compudj/librseq

(this dependency could be made optional in the future)

I know Doug is working on his own private repository for userspace 
instrumentation, and we share a lot of common goals.

Thanks,

Mathieu

> 
>> User provided addresses must be aligned on a 32-bit boundary, this
>> allows for single page checking and prevents odd behaviors such as a
>> 32-bit value straddling 2 pages instead of a single page.
>>
>> When page faults are encountered they are done asyncly via a workqueue.
>> If the page faults back in, the write update is attempted again. If the
>> page cannot fault-in, then we log and wait until the next time the event
>> is enabled/disabled. This is to prevent possible infinite loops resulting
>> from bad user processes unmapping or changing protection values after
>> registering the address.
>>
>> NOTE:
>> User programs that wish to have the enable bit shared across forks
>> either need to use a MAP_SHARED allocated address or register a new
>> address and file descriptor. If MAP_SHARED cannot be used or new
>> registrations cannot be done, then it's allowable to use MAP_PRIVATE
>> as long as the forked children never update the page themselves. Once
>> the page has been updated, the page from the parent will be copied over
>> to the child. This new copy-on-write page will not receive updates from
>> the kernel until another registration has been performed with this new
>> address.
>>
>> Beau Belgrave (2):
>>    tracing/user_events: Use remote writes for event enablement
>>    tracing/user_events: Fixup enable faults asyncly
>>
>>   include/linux/user_events.h      |  10 +-
>>   kernel/trace/trace_events_user.c | 396 ++++++++++++++++++++-----------
>>   2 files changed, 270 insertions(+), 136 deletions(-)
>>
>>
>> base-commit: 23758867219c8d84c8363316e6dd2f9fd7ae3049
>> -- 
>> 2.25.1
>>
> 
>
Beau Belgrave Oct. 31, 2022, 4:53 p.m. UTC | #6
On Sat, Oct 29, 2022 at 09:58:26AM -0400, Mathieu Desnoyers wrote:
> On 2022-10-28 18:17, Beau Belgrave wrote:
> > On Fri, Oct 28, 2022 at 05:50:04PM -0400, Mathieu Desnoyers wrote:
> > > On 2022-10-27 18:40, Beau Belgrave wrote:
> 
> [...]
> > 
> > > > 
> > > > NOTE:
> > > > User programs that wish to have the enable bit shared across forks
> > > > either need to use a MAP_SHARED allocated address or register a new
> > > > address and file descriptor. If MAP_SHARED cannot be used or new
> > > > registrations cannot be done, then it's allowable to use MAP_PRIVATE
> > > > as long as the forked children never update the page themselves. Once
> > > > the page has been updated, the page from the parent will be copied over
> > > > to the child. This new copy-on-write page will not receive updates from
> > > > the kernel until another registration has been performed with this new
> > > > address.
> > > 
> > > This seems rather odd. I would expect that if a parent process registers
> > > some instrumentation using private mappings for enabled state through the
> > > user events ioctl, and then forks, the child process would seamlessly be
> > > traced by the user events ABI while being able to also change the enabled
> > > state from the userspace tracer libraries (which would trigger COW).
> > > Requiring the child to re-register to user events is rather odd.
> > > 
> > 
> > It's the COW that is the problem, see below.
> > 
> > > What is preventing us from tracing the child without re-registration in this
> > > scenario ?
> > > 
> > 
> > Largely knowing when the COW occurs on a specific page. We don't make
> > the mappings, so I'm unsure if we can ask to be notified easily during
> > these times or not. If we could, that would solve this. I'm glad you are
> > thinking about this. The note here was exactly to trigger this
> > discussion :)
> > 
> > I believe this is the same as a Futex, I'll take another look at that
> > code to see if they've come up with anything regarding this.
> > 
> > Any ideas?
> 
> Based on your description of the symptoms, AFAIU, upon registration of a
> given user event associated with a mm_struct, the user events ioctl appears
> to translates the virtual address into a page pointer immediately, and keeps
> track of that page afterwards. This means it loses track of the page when
> COW occurs.
> 

No, we keep the memory descriptor and virtual address so we can properly
resolve to page per-process.

> Why not keep track of the registered virtual address and struct_mm
> associated with the event rather than the page ? Whenever a state change is
> needed, the virtual-address-to-page translation will be performed again. If
> it follows a COW, it will get the new copied page. If it happens that no COW
> was done, it should map to the original page. If the mapping is shared, the
> kernel would update that shared page. If the mapping is private, then the
> kernel would COW the page before updating it.
> 
> Thoughts ?
> 

I think you are forgetting about page table entries. My understanding is
the process will have the VMAs copied on fork, but the page table
entries will be marked read-only. Then when the write access occurs, the
COW is created (since the PTE says readonly, but the VMA says writable).
However, that COW page is now only mapped within that forked process
page table.

This requires tracking the child memory descriptors in addition to the
parent. The most straightforward way I see this happening is requiring
user side to mmap the user_event_data fd that is used for write. This
way when fork occurs in dup_mm() / dup_mmap() that mmap'd
user_event_data will get open() / close() called per-fork. I could then
copy the enablers from the parent but with the child's memory descriptor
to allow proper lookup.

This is like fork before COW, it's a bummer I cannot see a way to do
this per-page. Doing the above would work, but it requires copying all
the enablers, not just the one that changed after the fork.

> Thanks,
> 
> Mathieu
> 
> > 
> > Thanks,
> > -Beau
> 
> -- 
> Mathieu Desnoyers
> EfficiOS Inc.
> https://www.efficios.com

Thanks,
-Beau
Beau Belgrave Oct. 31, 2022, 5:27 p.m. UTC | #7
On Mon, Oct 31, 2022 at 11:15:56PM +0900, Masami Hiramatsu wrote:
> Hi Beau,
> 
> On Thu, 27 Oct 2022 15:40:09 -0700
> Beau Belgrave <beaub@linux.microsoft.com> wrote:
> 
> > As part of the discussions for user_events aligned with user space
> > tracers, it was determined that user programs should register a 32-bit
> > value to set or clear a bit when an event becomes enabled. Currently a
> > shared page is being used that requires mmap().
> > 
> > In this new model during the event registration from user programs 2 new
> > values are specified. The first is the address to update when the event
> > is either enabled or disabled. The second is the bit to set/clear to
> > reflect the event being enabled. This allows for a local 32-bit value in
> > user programs to support both kernel and user tracers. As an example,
> > setting bit 31 for kernel tracers when the event becomes enabled allows
> > for user tracers to use the other bits for ref counts or other flags.
> > The kernel side updates the bit atomically, user programs need to also
> > update these values atomically.
> 
> I think you means the kernel tracer (ftrace/perf) and user tracers (e.g. 
> LTTng) use the same 32bit data so that traced user-application only checks
> that data for checking an event is enabled, right?
> 

Yes, exactly, user code can just check a single uint32 or uint64 to tell
if anything is enabled (kernel or user tracer).

> If so, who the user tracer threads updates the data bit? Is that thread
> safe to update both kernel tracer and user tracers at the same time?
> 

This is why atomics are used to set the bit on the kernel side. The user
side should do the same. This is like the futex code. Do you see a
problem with atomics being used between user and kernel space on a
shared 32/64-bit address?

> And what is the actual advantage of this change? Are there any issue
> to use mmaped page? I would like to know more background of this
> change.
> 

Without this change user tracers like LTTng will have to check 2 values
instead of 1 to tell if the kernel tracer is enabled or not. Mathieu is
working on a user side tracing library in an effort to align writing
tracing code in user processes that works well for both kernel and user
tracers without much effort.

See here:
https://github.com/compudj/side

Are you proposing we keep the bitmap approach and have side library just
hook another branch? Mathieu had issues with that approach during our
talks.

> Could you also provide any sample program which I can play it? :)
> 

When I make the next patch version, I will update the user_events sample
so you'll have something to try out.

> > User provided addresses must be aligned on a 32-bit boundary, this
> > allows for single page checking and prevents odd behaviors such as a
> > 32-bit value straddling 2 pages instead of a single page.
> > 
> > When page faults are encountered they are done asyncly via a workqueue.
> > If the page faults back in, the write update is attempted again. If the
> > page cannot fault-in, then we log and wait until the next time the event
> > is enabled/disabled. This is to prevent possible infinite loops resulting
> > from bad user processes unmapping or changing protection values after
> > registering the address.
> > 
> > NOTE:
> > User programs that wish to have the enable bit shared across forks
> > either need to use a MAP_SHARED allocated address or register a new
> > address and file descriptor. If MAP_SHARED cannot be used or new
> > registrations cannot be done, then it's allowable to use MAP_PRIVATE
> > as long as the forked children never update the page themselves. Once
> > the page has been updated, the page from the parent will be copied over
> > to the child. This new copy-on-write page will not receive updates from
> > the kernel until another registration has been performed with this new
> > address.
> > 
> > Beau Belgrave (2):
> >   tracing/user_events: Use remote writes for event enablement
> >   tracing/user_events: Fixup enable faults asyncly
> > 
> >  include/linux/user_events.h      |  10 +-
> >  kernel/trace/trace_events_user.c | 396 ++++++++++++++++++++-----------
> >  2 files changed, 270 insertions(+), 136 deletions(-)
> > 
> > 
> > base-commit: 23758867219c8d84c8363316e6dd2f9fd7ae3049
> > -- 
> > 2.25.1
> > 
> 
> 
> -- 
> Masami Hiramatsu (Google) <mhiramat@kernel.org>

Thanks,
-Beau
Mathieu Desnoyers Oct. 31, 2022, 6:25 p.m. UTC | #8
On 2022-10-31 13:27, Beau Belgrave wrote:
> On Mon, Oct 31, 2022 at 11:15:56PM +0900, Masami Hiramatsu wrote:
[...]
>> And what is the actual advantage of this change? Are there any issue
>> to use mmaped page? I would like to know more background of this
>> change.
>>
> 
> Without this change user tracers like LTTng will have to check 2 values
> instead of 1 to tell if the kernel tracer is enabled or not. Mathieu is
> working on a user side tracing library in an effort to align writing
> tracing code in user processes that works well for both kernel and user
> tracers without much effort.
> 
> See here:
> https://github.com/compudj/side
> 
> Are you proposing we keep the bitmap approach and have side library just
> hook another branch? Mathieu had issues with that approach during our
> talks.

As overhead of the disabled tracepoints was a key factor in having the Linux
kernel adopt tracepoints when I created those back in 2008, I expect that having
minimal overhead in the disabled case will also prove to be a key factor for
adoption by user-space applications.

Another aspect that seems to be very important for wide adoption by user-space
is that the instrumentation library needs to have a license that is very
convenient for inclusion into statically linked software without additional
license requirements. This therefore excludes GPL and LGPL. I've used the MIT
license for the "side" project for that purpose.

Indeed, my ideal scenario is to use asm goto and implement something similar
to jump labels in user-space so the instrumentation only costs a no-op or a
jump when instrumentation is disabled. That can only be used in contexts where
code patching is allowed though (not for Runtime Integrity Checked (RIC) processes).

My next-to-best scenario is to have a single load (from fixed offset), test and
conditional branch in the userspace fast-path instead. This approach will need
to be made available as a fall-back for processes which are flagged as RIC-protected.

I currently focus my efforts on the load+test+conditional branch scheme, which is
somewhat simpler than the code patching approach in terms of needed infrastructure.

If we go for the current user events bitmap approach, then anything we do from
userspace will have more overhead (additional pointer chasing, loads, and masks
to apply). And it pretty much rules out code patching.

In terms of missing pieces to allow code patching to be done in userspace, here
is what I think we'd need:

- Extend the "side" (MIT-licensed) library to implement gadgets which support
code patching, but fall-back to load+test+conditional branch if code patching
is not available. Roughly, those would look like (this is really just pseudo-code):

.pushsection side_jmp
/*
  * &side_enabled_value is the key used to change the enabled/disabled state.
  * 1f is the address of the code to patch.
  * 3f is the address of branch target when disabled.
  * 4f is the address of branch target when enabled.
  */
.quad &side_enabled_value, 1f, 3f, 4f
.popsection

/*
  * Place all jump instructions that will be modified by code patching into a
  * single section. Therefore, this will minimize the amount of COW required when
  * patching code from executables and shared libraries that have instances in
  * many processes.
  */
.pushsection side_jmp_modify_code (executable section)
1:
jump to 2f
.popsection

jump to 1b
2:
load side_enabled_value
test
cond. branch to 4
3:
-> disabled
4:
-> enabled

When loading the .so or the executable, the initial states uses the load,
test, conditional branch. Then in a constructor, if code patching is available,
the jump at label (1) can be updated to target (3) instead. Then when enabled,
it can be updated to target (4) instead.

- Implement a code patching system call in the kernel which takes care of all the
details associated with code patching that supports concurrent execution (breakpoint
bypass, or stopping target processes if required by the architecture). This system
call could check whether the target process has Runtime Integrity Check enforced,
and refuse code patching as needed.

As a nice side-effect, this could allow us to implement things like "alternative"
assembler instruction selection in user-space.

- Figure out a way to let a user-space process let the kernel know that it needs
to enforce Runtime Integrity Check. It could be either a prctl(), or perhaps a
clone flag if this needs to be known very early in the process lifetime.

Thanks,

Mathieu
Masami Hiramatsu (Google) Nov. 1, 2022, 1:52 p.m. UTC | #9
On Mon, 31 Oct 2022 10:27:06 -0700
Beau Belgrave <beaub@linux.microsoft.com> wrote:

> On Mon, Oct 31, 2022 at 11:15:56PM +0900, Masami Hiramatsu wrote:
> > Hi Beau,
> > 
> > On Thu, 27 Oct 2022 15:40:09 -0700
> > Beau Belgrave <beaub@linux.microsoft.com> wrote:
> > 
> > > As part of the discussions for user_events aligned with user space
> > > tracers, it was determined that user programs should register a 32-bit
> > > value to set or clear a bit when an event becomes enabled. Currently a
> > > shared page is being used that requires mmap().
> > > 
> > > In this new model during the event registration from user programs 2 new
> > > values are specified. The first is the address to update when the event
> > > is either enabled or disabled. The second is the bit to set/clear to
> > > reflect the event being enabled. This allows for a local 32-bit value in
> > > user programs to support both kernel and user tracers. As an example,
> > > setting bit 31 for kernel tracers when the event becomes enabled allows
> > > for user tracers to use the other bits for ref counts or other flags.
> > > The kernel side updates the bit atomically, user programs need to also
> > > update these values atomically.
> > 
> > I think you means the kernel tracer (ftrace/perf) and user tracers (e.g. 
> > LTTng) use the same 32bit data so that traced user-application only checks
> > that data for checking an event is enabled, right?
> > 
> 
> Yes, exactly, user code can just check a single uint32 or uint64 to tell
> if anything is enabled (kernel or user tracer).
> 
> > If so, who the user tracer threads updates the data bit? Is that thread
> > safe to update both kernel tracer and user tracers at the same time?
> > 
> 
> This is why atomics are used to set the bit on the kernel side. The user
> side should do the same. This is like the futex code. Do you see a
> problem with atomics being used between user and kernel space on a
> shared 32/64-bit address?

Ah, OK. set_bit()/clear_bit() are atomic ops. So the user tracer must
use per-arch atomic ops implementation too. Hmm, can you comment it there?

> 
> > And what is the actual advantage of this change? Are there any issue
> > to use mmaped page? I would like to know more background of this
> > change.
> > 
> 
> Without this change user tracers like LTTng will have to check 2 values
> instead of 1 to tell if the kernel tracer is enabled or not. Mathieu is
> working on a user side tracing library in an effort to align writing
> tracing code in user processes that works well for both kernel and user
> tracers without much effort.
> 
> See here:
> https://github.com/compudj/side

Thanks for pointing!

> 
> Are you proposing we keep the bitmap approach and have side library just
> hook another branch? Mathieu had issues with that approach during our
> talks.

No, that makes things more complicated. We should choose one.

> 
> > Could you also provide any sample program which I can play it? :)
> > 
> 
> When I make the next patch version, I will update the user_events sample
> so you'll have something to try out.

That's helpful for me. We can have the code under tools/tracing/user_events/.

Thank you,

> 
> > > User provided addresses must be aligned on a 32-bit boundary, this
> > > allows for single page checking and prevents odd behaviors such as a
> > > 32-bit value straddling 2 pages instead of a single page.
> > > 
> > > When page faults are encountered they are done asyncly via a workqueue.
> > > If the page faults back in, the write update is attempted again. If the
> > > page cannot fault-in, then we log and wait until the next time the event
> > > is enabled/disabled. This is to prevent possible infinite loops resulting
> > > from bad user processes unmapping or changing protection values after
> > > registering the address.
> > > 
> > > NOTE:
> > > User programs that wish to have the enable bit shared across forks
> > > either need to use a MAP_SHARED allocated address or register a new
> > > address and file descriptor. If MAP_SHARED cannot be used or new
> > > registrations cannot be done, then it's allowable to use MAP_PRIVATE
> > > as long as the forked children never update the page themselves. Once
> > > the page has been updated, the page from the parent will be copied over
> > > to the child. This new copy-on-write page will not receive updates from
> > > the kernel until another registration has been performed with this new
> > > address.
> > > 
> > > Beau Belgrave (2):
> > >   tracing/user_events: Use remote writes for event enablement
> > >   tracing/user_events: Fixup enable faults asyncly
> > > 
> > >  include/linux/user_events.h      |  10 +-
> > >  kernel/trace/trace_events_user.c | 396 ++++++++++++++++++++-----------
> > >  2 files changed, 270 insertions(+), 136 deletions(-)
> > > 
> > > 
> > > base-commit: 23758867219c8d84c8363316e6dd2f9fd7ae3049
> > > -- 
> > > 2.25.1
> > > 
> > 
> > 
> > -- 
> > Masami Hiramatsu (Google) <mhiramat@kernel.org>
> 
> Thanks,
> -Beau
Beau Belgrave Nov. 1, 2022, 4:55 p.m. UTC | #10
On Tue, Nov 01, 2022 at 10:52:20PM +0900, Masami Hiramatsu wrote:
> On Mon, 31 Oct 2022 10:27:06 -0700
> Beau Belgrave <beaub@linux.microsoft.com> wrote:
> 
> > On Mon, Oct 31, 2022 at 11:15:56PM +0900, Masami Hiramatsu wrote:
> > > Hi Beau,
> > > 
> > > On Thu, 27 Oct 2022 15:40:09 -0700
> > > Beau Belgrave <beaub@linux.microsoft.com> wrote:
> > > 
> > > > As part of the discussions for user_events aligned with user space
> > > > tracers, it was determined that user programs should register a 32-bit
> > > > value to set or clear a bit when an event becomes enabled. Currently a
> > > > shared page is being used that requires mmap().
> > > > 
> > > > In this new model during the event registration from user programs 2 new
> > > > values are specified. The first is the address to update when the event
> > > > is either enabled or disabled. The second is the bit to set/clear to
> > > > reflect the event being enabled. This allows for a local 32-bit value in
> > > > user programs to support both kernel and user tracers. As an example,
> > > > setting bit 31 for kernel tracers when the event becomes enabled allows
> > > > for user tracers to use the other bits for ref counts or other flags.
> > > > The kernel side updates the bit atomically, user programs need to also
> > > > update these values atomically.
> > > 
> > > I think you means the kernel tracer (ftrace/perf) and user tracers (e.g. 
> > > LTTng) use the same 32bit data so that traced user-application only checks
> > > that data for checking an event is enabled, right?
> > > 
> > 
> > Yes, exactly, user code can just check a single uint32 or uint64 to tell
> > if anything is enabled (kernel or user tracer).
> > 
> > > If so, who the user tracer threads updates the data bit? Is that thread
> > > safe to update both kernel tracer and user tracers at the same time?
> > > 
> > 
> > This is why atomics are used to set the bit on the kernel side. The user
> > side should do the same. This is like the futex code. Do you see a
> > problem with atomics being used between user and kernel space on a
> > shared 32/64-bit address?
> 
> Ah, OK. set_bit()/clear_bit() are atomic ops. So the user tracer must
> use per-arch atomic ops implementation too. Hmm, can you comment it there?
> 

I can add a comment here, I also plan to update our documentation. I
really want to get good feedback on this first, so I avoid updating the
documentation several times as we progress this conversation. Expect
documentation updates when I flip from RFC to normal patchset.

> > 
> > > And what is the actual advantage of this change? Are there any issue
> > > to use mmaped page? I would like to know more background of this
> > > change.
> > > 
> > 
> > Without this change user tracers like LTTng will have to check 2 values
> > instead of 1 to tell if the kernel tracer is enabled or not. Mathieu is
> > working on a user side tracing library in an effort to align writing
> > tracing code in user processes that works well for both kernel and user
> > tracers without much effort.
> > 
> > See here:
> > https://github.com/compudj/side
> 
> Thanks for pointing!
> 
> > 
> > Are you proposing we keep the bitmap approach and have side library just
> > hook another branch? Mathieu had issues with that approach during our
> > talks.
> 
> No, that makes things more complicated. We should choose one.
> 

Agree, it seems we are settling behind the user provided address
approach, as long as we can work through fork() and other scenarios.
During the bi-weekly tracefs meetings we've been going back and forth
on which approach to take.

I promised a RFC patch to see how far I could get on this to see what
edge cases exist that we need to work through. Currently fork() seems
the hardest to do with private mappings, but I believe I have a path
forward that I'll put in the next version of this patchset.

> > 
> > > Could you also provide any sample program which I can play it? :)
> > > 
> > 
> > When I make the next patch version, I will update the user_events sample
> > so you'll have something to try out.
> 
> That's helpful for me. We can have the code under tools/tracing/user_events/.
> 

I was planning to update the existing sample at samples/user_events/.
Any reason that location can't be used?

> Thank you,
> 
> > 
> > > > User provided addresses must be aligned on a 32-bit boundary, this
> > > > allows for single page checking and prevents odd behaviors such as a
> > > > 32-bit value straddling 2 pages instead of a single page.
> > > > 
> > > > When page faults are encountered they are done asyncly via a workqueue.
> > > > If the page faults back in, the write update is attempted again. If the
> > > > page cannot fault-in, then we log and wait until the next time the event
> > > > is enabled/disabled. This is to prevent possible infinite loops resulting
> > > > from bad user processes unmapping or changing protection values after
> > > > registering the address.
> > > > 
> > > > NOTE:
> > > > User programs that wish to have the enable bit shared across forks
> > > > either need to use a MAP_SHARED allocated address or register a new
> > > > address and file descriptor. If MAP_SHARED cannot be used or new
> > > > registrations cannot be done, then it's allowable to use MAP_PRIVATE
> > > > as long as the forked children never update the page themselves. Once
> > > > the page has been updated, the page from the parent will be copied over
> > > > to the child. This new copy-on-write page will not receive updates from
> > > > the kernel until another registration has been performed with this new
> > > > address.
> > > > 
> > > > Beau Belgrave (2):
> > > >   tracing/user_events: Use remote writes for event enablement
> > > >   tracing/user_events: Fixup enable faults asyncly
> > > > 
> > > >  include/linux/user_events.h      |  10 +-
> > > >  kernel/trace/trace_events_user.c | 396 ++++++++++++++++++++-----------
> > > >  2 files changed, 270 insertions(+), 136 deletions(-)
> > > > 
> > > > 
> > > > base-commit: 23758867219c8d84c8363316e6dd2f9fd7ae3049
> > > > -- 
> > > > 2.25.1
> > > > 
> > > 
> > > 
> > > -- 
> > > Masami Hiramatsu (Google) <mhiramat@kernel.org>
> > 
> > Thanks,
> > -Beau
> 
> 
> -- 
> Masami Hiramatsu (Google) <mhiramat@kernel.org>

Thanks,
-Beau
Mathieu Desnoyers Nov. 2, 2022, 1:46 p.m. UTC | #11
On 2022-10-31 12:53, Beau Belgrave wrote:
> On Sat, Oct 29, 2022 at 09:58:26AM -0400, Mathieu Desnoyers wrote:
>> On 2022-10-28 18:17, Beau Belgrave wrote:
>>> On Fri, Oct 28, 2022 at 05:50:04PM -0400, Mathieu Desnoyers wrote:
>>>> On 2022-10-27 18:40, Beau Belgrave wrote:
>>
>> [...]
>>>
>>>>>
>>>>> NOTE:
>>>>> User programs that wish to have the enable bit shared across forks
>>>>> either need to use a MAP_SHARED allocated address or register a new
>>>>> address and file descriptor. If MAP_SHARED cannot be used or new
>>>>> registrations cannot be done, then it's allowable to use MAP_PRIVATE
>>>>> as long as the forked children never update the page themselves. Once
>>>>> the page has been updated, the page from the parent will be copied over
>>>>> to the child. This new copy-on-write page will not receive updates from
>>>>> the kernel until another registration has been performed with this new
>>>>> address.
>>>>
>>>> This seems rather odd. I would expect that if a parent process registers
>>>> some instrumentation using private mappings for enabled state through the
>>>> user events ioctl, and then forks, the child process would seamlessly be
>>>> traced by the user events ABI while being able to also change the enabled
>>>> state from the userspace tracer libraries (which would trigger COW).
>>>> Requiring the child to re-register to user events is rather odd.
>>>>
>>>
>>> It's the COW that is the problem, see below.
>>>
>>>> What is preventing us from tracing the child without re-registration in this
>>>> scenario ?
>>>>
>>>
>>> Largely knowing when the COW occurs on a specific page. We don't make
>>> the mappings, so I'm unsure if we can ask to be notified easily during
>>> these times or not. If we could, that would solve this. I'm glad you are
>>> thinking about this. The note here was exactly to trigger this
>>> discussion :)
>>>
>>> I believe this is the same as a Futex, I'll take another look at that
>>> code to see if they've come up with anything regarding this.
>>>
>>> Any ideas?
>>
>> Based on your description of the symptoms, AFAIU, upon registration of a
>> given user event associated with a mm_struct, the user events ioctl appears
>> to translates the virtual address into a page pointer immediately, and keeps
>> track of that page afterwards. This means it loses track of the page when
>> COW occurs.
>>
> 
> No, we keep the memory descriptor and virtual address so we can properly
> resolve to page per-process.
> 
>> Why not keep track of the registered virtual address and struct_mm
>> associated with the event rather than the page ? Whenever a state change is
>> needed, the virtual-address-to-page translation will be performed again. If
>> it follows a COW, it will get the new copied page. If it happens that no COW
>> was done, it should map to the original page. If the mapping is shared, the
>> kernel would update that shared page. If the mapping is private, then the
>> kernel would COW the page before updating it.
>>
>> Thoughts ?
>>
> 
> I think you are forgetting about page table entries. My understanding is
> the process will have the VMAs copied on fork, but the page table
> entries will be marked read-only. Then when the write access occurs, the
> COW is created (since the PTE says readonly, but the VMA says writable).
> However, that COW page is now only mapped within that forked process
> page table.
> 
> This requires tracking the child memory descriptors in addition to the
> parent. The most straightforward way I see this happening is requiring
> user side to mmap the user_event_data fd that is used for write. This
> way when fork occurs in dup_mm() / dup_mmap() that mmap'd
> user_event_data will get open() / close() called per-fork. I could then
> copy the enablers from the parent but with the child's memory descriptor
> to allow proper lookup.
> 
> This is like fork before COW, it's a bummer I cannot see a way to do
> this per-page. Doing the above would work, but it requires copying all
> the enablers, not just the one that changed after the fork.

This brings an overall design concern I have with user-events: AFAIU, 
the lifetime of the user event registration appears to be linked to the 
lifetime of a file descriptor.

What happens when that file descriptor is duplicated and send over to 
another process through unix sockets credentials ? Does it mean that the 
kernel have a handle on the wrong process to update the "enabled" state?

Also, what happens on execve system call if the file descriptor 
representing the user event is not marked as close-on-exec ? Does it 
mean the kernel can corrupt user-space memory of the after-exec loaded 
binary when it attempts to update the "enabled" state ?

If I get this right, I suspect we might want to move the lifetime of the 
user event registration to the memory space (mm_struct).

Thanks,

Mathieu
Beau Belgrave Nov. 2, 2022, 5:18 p.m. UTC | #12
On Wed, Nov 02, 2022 at 09:46:31AM -0400, Mathieu Desnoyers wrote:
> On 2022-10-31 12:53, Beau Belgrave wrote:
> > On Sat, Oct 29, 2022 at 09:58:26AM -0400, Mathieu Desnoyers wrote:
> > > On 2022-10-28 18:17, Beau Belgrave wrote:
> > > > On Fri, Oct 28, 2022 at 05:50:04PM -0400, Mathieu Desnoyers wrote:
> > > > > On 2022-10-27 18:40, Beau Belgrave wrote:
> > > 
> > > [...]
> > > > 
> > > > > > 
> > > > > > NOTE:
> > > > > > User programs that wish to have the enable bit shared across forks
> > > > > > either need to use a MAP_SHARED allocated address or register a new
> > > > > > address and file descriptor. If MAP_SHARED cannot be used or new
> > > > > > registrations cannot be done, then it's allowable to use MAP_PRIVATE
> > > > > > as long as the forked children never update the page themselves. Once
> > > > > > the page has been updated, the page from the parent will be copied over
> > > > > > to the child. This new copy-on-write page will not receive updates from
> > > > > > the kernel until another registration has been performed with this new
> > > > > > address.
> > > > > 
> > > > > This seems rather odd. I would expect that if a parent process registers
> > > > > some instrumentation using private mappings for enabled state through the
> > > > > user events ioctl, and then forks, the child process would seamlessly be
> > > > > traced by the user events ABI while being able to also change the enabled
> > > > > state from the userspace tracer libraries (which would trigger COW).
> > > > > Requiring the child to re-register to user events is rather odd.
> > > > > 
> > > > 
> > > > It's the COW that is the problem, see below.
> > > > 
> > > > > What is preventing us from tracing the child without re-registration in this
> > > > > scenario ?
> > > > > 
> > > > 
> > > > Largely knowing when the COW occurs on a specific page. We don't make
> > > > the mappings, so I'm unsure if we can ask to be notified easily during
> > > > these times or not. If we could, that would solve this. I'm glad you are
> > > > thinking about this. The note here was exactly to trigger this
> > > > discussion :)
> > > > 
> > > > I believe this is the same as a Futex, I'll take another look at that
> > > > code to see if they've come up with anything regarding this.
> > > > 
> > > > Any ideas?
> > > 
> > > Based on your description of the symptoms, AFAIU, upon registration of a
> > > given user event associated with a mm_struct, the user events ioctl appears
> > > to translates the virtual address into a page pointer immediately, and keeps
> > > track of that page afterwards. This means it loses track of the page when
> > > COW occurs.
> > > 
> > 
> > No, we keep the memory descriptor and virtual address so we can properly
> > resolve to page per-process.
> > 
> > > Why not keep track of the registered virtual address and struct_mm
> > > associated with the event rather than the page ? Whenever a state change is
> > > needed, the virtual-address-to-page translation will be performed again. If
> > > it follows a COW, it will get the new copied page. If it happens that no COW
> > > was done, it should map to the original page. If the mapping is shared, the
> > > kernel would update that shared page. If the mapping is private, then the
> > > kernel would COW the page before updating it.
> > > 
> > > Thoughts ?
> > > 
> > 
> > I think you are forgetting about page table entries. My understanding is
> > the process will have the VMAs copied on fork, but the page table
> > entries will be marked read-only. Then when the write access occurs, the
> > COW is created (since the PTE says readonly, but the VMA says writable).
> > However, that COW page is now only mapped within that forked process
> > page table.
> > 
> > This requires tracking the child memory descriptors in addition to the
> > parent. The most straightforward way I see this happening is requiring
> > user side to mmap the user_event_data fd that is used for write. This
> > way when fork occurs in dup_mm() / dup_mmap() that mmap'd
> > user_event_data will get open() / close() called per-fork. I could then
> > copy the enablers from the parent but with the child's memory descriptor
> > to allow proper lookup.
> > 
> > This is like fork before COW, it's a bummer I cannot see a way to do
> > this per-page. Doing the above would work, but it requires copying all
> > the enablers, not just the one that changed after the fork.
> 
> This brings an overall design concern I have with user-events: AFAIU, the
> lifetime of the user event registration appears to be linked to the lifetime
> of a file descriptor.
> 

The lifetime of the user_event is linked to the lifetime of the
tracepoint. The tracepoint stays alive until someone explicitly
tries to delete it via the del IOCTL.

If the delete is attempted and there are references out to that
user_event, either via perf/ftrace or any open files or mm_stucts it
will not be allowed to go away.

The user_event does not go away automatically upon file release (last
close). However, when that file goes away, obviously the caller no
longer can write to it. This is why there are user_events within the
group, and then there are per-file user_event_refs. It allows for
tracking these lifetimes and writes in isolation.

> What happens when that file descriptor is duplicated and send over to
> another process through unix sockets credentials ? Does it mean that the
> kernel have a handle on the wrong process to update the "enabled" state?
> 

You'll have to expand upon this more, if the FD is duplicated and
installed into another process, then the "enabled" state is still at
whatever mm_struct registered it. If that new process wants to have an
enabled state, it must register in it's own process if it wasn't from a
fork. The mm_struct can only change pages in that process, it cannot
jump across to another process. This is why the fork case needs an
enabler clone with a new mm_struct, and why if it gets duplicated in an
odd way into another process, that process must register it's own
enabler states.

> Also, what happens on execve system call if the file descriptor representing
> the user event is not marked as close-on-exec ? Does it mean the kernel can
> corrupt user-space memory of the after-exec loaded binary when it attempts
> to update the "enabled" state ?
> 

I believe it could if the memory descriptor remains, callers should
mark it close-on-exec to prevent this. None of this was a problem
with the old ABI :)

For clarity, since I cannot tell:
Are you advocating for a different approach here or just calling out
we need to add guidance to user space programs to do the right thing?

> If I get this right, I suspect we might want to move the lifetime of the
> user event registration to the memory space (mm_struct).
> 

If that file or mmap I proposed stays open, the enablers will stay open.
The enabler keeps the mm_struct alive, not the other way around.

I'm not sure I follow under what condition you'd have an enabler /
user_event around with a mm_struct that has gone away. The enabler keeps
the mm_struct alive until the enabler goes away to prevent any funny
business. That appears to fit more inline with what others have done in
the kernel than trying to tie the enabler to the mm_struct lifetime.

If a memory descriptor goes away, then it's FD's should close (Is there
ever a case this is not true?). If the FD's go away, the enablers close
down, if the enablers close down, the user_event ref's drop to 0. The
user_event can then be deleted via an explicit IOCTL, and will only work
if at that point in time the ref count is still 0.

> Thanks,
> 
> Mathieu
> 
> -- 
> Mathieu Desnoyers
> EfficiOS Inc.
> https://www.efficios.com

Thanks,
-Beau