mbox series

[RFC,0/4] Add support to directly attach BPF program to ftrace

Message ID 20190710141548.132193-1-joel@joelfernandes.org (mailing list archive)
Headers show
Series Add support to directly attach BPF program to ftrace | expand

Message

Joel Fernandes July 10, 2019, 2:15 p.m. UTC
Hi,
These patches make it possible to attach BPF programs directly to tracepoints
using ftrace (/sys/kernel/debug/tracing) without needing the process doing the
attach to be alive. This has the following benefits:

1. Simplified Security: In Android, we have finer-grained security controls to
specific ftrace trace events using SELinux labels. We control precisely who is
allowed to enable an ftrace event already. By adding a node to ftrace for
attaching BPF programs, we can use the same mechanism to further control who is
allowed to attach to a trace event.

2. Process lifetime: In Android we are adding usecases where a tracing program
needs to be attached all the time to a tracepoint, for the full life time of
the system. Such as to gather statistics where there no need for a detach for
the full system lifetime. With perf or bpf(2)'s BPF_RAW_TRACEPOINT_OPEN, this
means keeping a process alive all the time.  However, in Android our BPF loader
currently (for hardeneded security) involves just starting a process at boot
time, doing the BPF program loading, and then pinning them to /sys/fs/bpf.  We
don't keep this process alive all the time. It is more suitable to do a
one-shot attach of the program using ftrace and not need to have a process
alive all the time anymore for this. Such process also needs elevated
privileges since tracepoint program loading currently requires CAP_SYS_ADMIN
anyway so by design Android's bpfloader runs once at init and exits.

This series add a new bpf file to /sys/kernel/debug/tracing/events/X/Y/bpf
The following commands can be written into it:
attach:<fd>     Attaches BPF prog fd to tracepoint
detach:<fd>     Detaches BPF prog fd to tracepoint

Reading the bpf file will show all the attached programs to the tracepoint.

Joel Fernandes (Google) (4):
Move bpf_raw_tracepoint functionality into bpf_trace.c
trace/bpf: Add support for attach/detach of ftrace events to BPF
lib/bpf: Add support for ftrace event attach and detach
selftests/bpf: Add test for ftrace-based BPF attach/detach

include/linux/bpf_trace.h                     |  16 ++
include/linux/trace_events.h                  |   1 +
kernel/bpf/syscall.c                          |  69 +-----
kernel/trace/bpf_trace.c                      | 225 ++++++++++++++++++
kernel/trace/trace.h                          |   1 +
kernel/trace/trace_events.c                   |   8 +
tools/lib/bpf/bpf.c                           |  53 +++++
tools/lib/bpf/bpf.h                           |   4 +
tools/lib/bpf/libbpf.map                      |   2 +
.../raw_tp_writable_test_ftrace_run.c         |  89 +++++++
10 files changed, 410 insertions(+), 58 deletions(-)
create mode 100644 tools/testing/selftests/bpf/prog_tests/raw_tp_writable_test_ftrace_run.c

--
2.22.0.410.gd8fdbe21b5-goog

Comments

Alexei Starovoitov July 16, 2019, 8:54 p.m. UTC | #1
On Wed, Jul 10, 2019 at 10:15:44AM -0400, Joel Fernandes (Google) wrote:
> Hi,

why are you cc-ing the whole world for this patch set?
I'll reply to all as well, but I suspect a bunch of folks consider it spam.
Please read Documentation/bpf/bpf_devel_QA.rst

Also, I think, netdev@vger rejects emails with 80+ characters in cc as spam,
so I'm not sure this set reached public mailing lists.

> These patches make it possible to attach BPF programs directly to tracepoints
> using ftrace (/sys/kernel/debug/tracing) without needing the process doing the
> attach to be alive. This has the following benefits:
> 
> 1. Simplified Security: In Android, we have finer-grained security controls to
> specific ftrace trace events using SELinux labels. We control precisely who is
> allowed to enable an ftrace event already. By adding a node to ftrace for
> attaching BPF programs, we can use the same mechanism to further control who is
> allowed to attach to a trace event.
> 
> 2. Process lifetime: In Android we are adding usecases where a tracing program
> needs to be attached all the time to a tracepoint, for the full life time of
> the system. Such as to gather statistics where there no need for a detach for
> the full system lifetime. With perf or bpf(2)'s BPF_RAW_TRACEPOINT_OPEN, this
> means keeping a process alive all the time.  However, in Android our BPF loader
> currently (for hardeneded security) involves just starting a process at boot
> time, doing the BPF program loading, and then pinning them to /sys/fs/bpf.  We
> don't keep this process alive all the time. It is more suitable to do a
> one-shot attach of the program using ftrace and not need to have a process
> alive all the time anymore for this. Such process also needs elevated
> privileges since tracepoint program loading currently requires CAP_SYS_ADMIN
> anyway so by design Android's bpfloader runs once at init and exits.
> 
> This series add a new bpf file to /sys/kernel/debug/tracing/events/X/Y/bpf
> The following commands can be written into it:
> attach:<fd>     Attaches BPF prog fd to tracepoint
> detach:<fd>     Detaches BPF prog fd to tracepoint

Looks like, to detach a program the user needs to read a text file,
parse bpf prog id from text into binary. Then call fd_from_id bpf syscall,
get a binary FD, convert it back to text and write as a text back into this file.
I think this is just a single example why text based apis are not accepted
in bpf anymore.

Through the patch set you call it ftrace. As far as I can see, this set
has zero overlap with ftrace. There is no ftrace-bpf connection here at all
that we discussed in the past Steven. It's all quite confusing.

I suggest android to solve sticky raw_tracepoint problem with user space deamon.
The reasons, you point out why user daemon cannot be used, sound weak to me.
Another acceptable solution would be to introduce pinning of raw_tp objects.
bpf progs and maps can be pinned in bpffs already. Pinning raw_tp would
be natural extension.
Joel Fernandes July 16, 2019, 9:30 p.m. UTC | #2
On Tue, Jul 16, 2019 at 01:54:57PM -0700, Alexei Starovoitov wrote:
> On Wed, Jul 10, 2019 at 10:15:44AM -0400, Joel Fernandes (Google) wrote:
> > Hi,
> 
> why are you cc-ing the whole world for this patch set?

Well, the whole world happens to be interested in BPF on Android.

> I'll reply to all as well, but I suspect a bunch of folks consider it spam.
> Please read Documentation/bpf/bpf_devel_QA.rst

Ok, I'll read it.

> Also, I think, netdev@vger rejects emails with 80+ characters in cc as spam,
> so I'm not sure this set reached public mailing lists.

Certainly the CC list here is not added to folks who consider it spam. All
the folks added have been interested in BPF on Android at various points of
time. Is this CC list really that large? It has around 24 email addresses or
so. I can trim it a bit if needed. Also, you sound like as if people are
screaming at me to stop emailing them, certainly that's not the case and no
one has told me it is spam.

And, it did reach the public archive btw:
https://lore.kernel.org/netdev/20190716205455.iimn3pqpvsc3k4ry@ast-mbp.dhcp.thefacebook.com/T/#m1460ba463b78312e38b68b8c118f673d2ead9446

> > These patches make it possible to attach BPF programs directly to tracepoints
> > using ftrace (/sys/kernel/debug/tracing) without needing the process doing the
> > attach to be alive. This has the following benefits:
> > 
> > 1. Simplified Security: In Android, we have finer-grained security controls to
> > specific ftrace trace events using SELinux labels. We control precisely who is
> > allowed to enable an ftrace event already. By adding a node to ftrace for
> > attaching BPF programs, we can use the same mechanism to further control who is
> > allowed to attach to a trace event.
> > 
> > 2. Process lifetime: In Android we are adding usecases where a tracing program
> > needs to be attached all the time to a tracepoint, for the full life time of
> > the system. Such as to gather statistics where there no need for a detach for
> > the full system lifetime. With perf or bpf(2)'s BPF_RAW_TRACEPOINT_OPEN, this
> > means keeping a process alive all the time.  However, in Android our BPF loader
> > currently (for hardeneded security) involves just starting a process at boot
> > time, doing the BPF program loading, and then pinning them to /sys/fs/bpf.  We
> > don't keep this process alive all the time. It is more suitable to do a
> > one-shot attach of the program using ftrace and not need to have a process
> > alive all the time anymore for this. Such process also needs elevated
> > privileges since tracepoint program loading currently requires CAP_SYS_ADMIN
> > anyway so by design Android's bpfloader runs once at init and exits.
> > 
> > This series add a new bpf file to /sys/kernel/debug/tracing/events/X/Y/bpf
> > The following commands can be written into it:
> > attach:<fd>     Attaches BPF prog fd to tracepoint
> > detach:<fd>     Detaches BPF prog fd to tracepoint
> 
> Looks like, to detach a program the user needs to read a text file,
> parse bpf prog id from text into binary. Then call fd_from_id bpf syscall,
> get a binary FD, convert it back to text and write as a text back into this file.
> I think this is just a single example why text based apis are not accepted
> in bpf anymore.

This can also be considered a tracefs API.

And we can certainly change the detach to accept program ids as well if
that's easier. 'detach:prog:<prog_id>' and 'detach:fd:<fd>'.

By the way, I can also list the set of cumbersome steps needed to attach a
BPF program using perf and I bet it will be longer ;-)

> Through the patch set you call it ftrace. As far as I can see, this set
> has zero overlap with ftrace. There is no ftrace-bpf connection here at all
> that we discussed in the past Steven. It's all quite confusing.

It depends on what you mean by ftrace, may be I can call it 'trace events' or
something if it is less ambiguious. All of this has been collectively called
ftrace before.

I am not sure if you you are making sense actually, trace_events mechanism is
a part of ftrace. See the documentation: Documentation/trace/ftrace.rst. Even
the documentation file name has the word ftrace in it.

I have also spoken to Steven before about this, I don't think he ever told me
there is no connection so again I am a bit lost at your comments.

> I suggest android to solve sticky raw_tracepoint problem with user space deamon.
> The reasons, you point out why user daemon cannot be used, sound weak to me.

I don't think it is weak. It seems overkill to have a daemon for a trace
event that is say supposed to be attached to all the time for the lifetime of
the system. Why should there be a daemon consuming resources if it is active
all the time?

In Android, we are very careful about spawning useless processes and leaving
them alive for the lifetime of the system - for no good reason. Our security
teams also don't like this, and they can comment more.

> Another acceptable solution would be to introduce pinning of raw_tp objects.
> bpf progs and maps can be pinned in bpffs already. Pinning raw_tp would
> be natural extension.

I don't think the pinning solves the security problem, it just solves the
process lifetime problem. Currently, attaching trace events through perf
requires CAP_SYS_ADMIN. However, with ftrace events, we already control
security of events by labeling the nodes in tracefs and granting access to
the labeled context through the selinux policies. Having a 'bpf' node in
tracefs for events, and granting access to the labels is a natural extension.

I also thought about the pinning idea before, but we also want to add support
for not just raw tracepoints, but also regular tracepoints (events if you
will). I am hesitant to add a new BPF API just for creating regular
tracepoints and then pinning those as well.

I don't see why a new bpf node for a trace event is a bad idea, really.
tracefs is how we deal with trace events on Android. We do it in production
systems. This is a natural extension to that and fits with the security model
well.

thanks,

 - Joel
Alexei Starovoitov July 16, 2019, 10:26 p.m. UTC | #3
On Tue, Jul 16, 2019 at 05:30:50PM -0400, Joel Fernandes wrote:
> 
> I also thought about the pinning idea before, but we also want to add support
> for not just raw tracepoints, but also regular tracepoints (events if you
> will). I am hesitant to add a new BPF API just for creating regular
> tracepoints and then pinning those as well.

and they should be done through the pinning as well.

> I don't see why a new bpf node for a trace event is a bad idea, really.

See the patches for kprobe/uprobe FD-based api and the reasons behind it.
tldr: text is racy, doesn't scale, poor security, etc.

> tracefs is how we deal with trace events on Android. 

android has made mistakes in the past as well.

> This is a natural extension to that and fits with the security model
> well.

I think it's the opposite.
I'm absolutely against text based apis.
Steven Rostedt July 16, 2019, 10:31 p.m. UTC | #4
On Tue, 16 Jul 2019 17:30:50 -0400
Joel Fernandes <joel@joelfernandes.org> wrote:

> I don't see why a new bpf node for a trace event is a bad idea, really.
> tracefs is how we deal with trace events on Android. We do it in production
> systems. This is a natural extension to that and fits with the security model
> well.

What I would like to see is a way to have BPF inject data into the
ftrace ring buffer directly. There's a bpf_trace_printk() that I find a
bit of a hack (especially since it hooks into trace_printk() which is
only for debugging purposes). Have a dedicated bpf ftrace ring
buffer event that can be triggered is what I am looking for. Then comes
the issue of what ring buffer to place it in, as ftrace can have
multiple ring buffer instances. But these instances are defined by the
tracefs instances directory. Having a way to associate a bpf program to
a specific event in a specific tracefs directory could allow for ways to
trigger writing into the correct ftrace buffer.

But looking over the patches, I see what Alexei means that there's no
overlap with ftrace and these patches except for the tracefs directory
itself (which is part of the ftrace infrastructure). And the trace
events are technically part of the ftrace infrastructure too. I see the
tracefs interface being used, but I don't see how the bpf programs
being added affect the ftrace ring buffer or other parts of ftrace. And
I'm guessing that's what is confusing Alexei.

-- Steve
Joel Fernandes July 16, 2019, 10:41 p.m. UTC | #5
On Tue, Jul 16, 2019 at 03:26:52PM -0700, Alexei Starovoitov wrote:
> On Tue, Jul 16, 2019 at 05:30:50PM -0400, Joel Fernandes wrote:
> > 
> > I also thought about the pinning idea before, but we also want to add support
> > for not just raw tracepoints, but also regular tracepoints (events if you
> > will). I am hesitant to add a new BPF API just for creating regular
> > tracepoints and then pinning those as well.
> 
> and they should be done through the pinning as well.

Hmm ok, I will give it some more thought.

> > I don't see why a new bpf node for a trace event is a bad idea, really.
> 
> See the patches for kprobe/uprobe FD-based api and the reasons behind it.
> tldr: text is racy, doesn't scale, poor security, etc.

Is it possible to use perf without CAP_SYS_ADMIN and control security at the
per-event level? We are selective about who can access which event, using
selinux. That's how our ftrace-based tracers work. Its fine grained per-event
control. That's where I was going with the tracefs approach since we get that
granularity using the file system.

Thanks.
Steven Rostedt July 16, 2019, 10:43 p.m. UTC | #6
On Tue, 16 Jul 2019 15:26:52 -0700
Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:

> I'm absolutely against text based apis.

I guess you don't use /proc ;-)

-- Steve
Joel Fernandes July 16, 2019, 10:46 p.m. UTC | #7
On Tue, Jul 16, 2019 at 06:31:17PM -0400, Steven Rostedt wrote:
> On Tue, 16 Jul 2019 17:30:50 -0400
> Joel Fernandes <joel@joelfernandes.org> wrote:
> 
> > I don't see why a new bpf node for a trace event is a bad idea, really.
> > tracefs is how we deal with trace events on Android. We do it in production
> > systems. This is a natural extension to that and fits with the security model
> > well.
> 
> What I would like to see is a way to have BPF inject data into the
> ftrace ring buffer directly. There's a bpf_trace_printk() that I find a
> bit of a hack (especially since it hooks into trace_printk() which is
> only for debugging purposes). Have a dedicated bpf ftrace ring
> buffer event that can be triggered is what I am looking for. Then comes
> the issue of what ring buffer to place it in, as ftrace can have
> multiple ring buffer instances. But these instances are defined by the
> tracefs instances directory. Having a way to associate a bpf program to
> a specific event in a specific tracefs directory could allow for ways to
> trigger writing into the correct ftrace buffer.

But his problem is with doing the association of a BPF program with tracefs
itself. How would you attach a BPF program with tracefs without doing a text
based approach? His problem is with the text based approach per his last
email.

> But looking over the patches, I see what Alexei means that there's no
> overlap with ftrace and these patches except for the tracefs directory
> itself (which is part of the ftrace infrastructure). And the trace
> events are technically part of the ftrace infrastructure too. I see the
> tracefs interface being used, but I don't see how the bpf programs
> being added affect the ftrace ring buffer or other parts of ftrace. And
> I'm guessing that's what is confusing Alexei.

In a follow-up patch which I am still writing, I am using the trace ring
buffer as temporary storage since I am formatting the trace event into it.
This patch you are replying to is just for raw tracepoint and yes, I agree
this one does not use the ring buffer, but a future addition to it does. So
I don't think the association of this patch series with ftrace is going to be
an issue IMO.

thanks,

 - Joel
Joel Fernandes July 16, 2019, 11:55 p.m. UTC | #8
On Tue, Jul 16, 2019 at 06:41:50PM -0400, Joel Fernandes wrote:
> On Tue, Jul 16, 2019 at 03:26:52PM -0700, Alexei Starovoitov wrote:
> > On Tue, Jul 16, 2019 at 05:30:50PM -0400, Joel Fernandes wrote:
> > > 
> > > I also thought about the pinning idea before, but we also want to add support
> > > for not just raw tracepoints, but also regular tracepoints (events if you
> > > will). I am hesitant to add a new BPF API just for creating regular
> > > tracepoints and then pinning those as well.
> > 
> > and they should be done through the pinning as well.
> 
> Hmm ok, I will give it some more thought.

I think I can make the new BPF API + pinning approach work, I will try to
work on something like this and post it soon.

Also, I had a question below if you don't mind taking a look:

thanks Alexei!

> > > I don't see why a new bpf node for a trace event is a bad idea, really.
> > 
> > See the patches for kprobe/uprobe FD-based api and the reasons behind it.
> > tldr: text is racy, doesn't scale, poor security, etc.
> 
> Is it possible to use perf without CAP_SYS_ADMIN and control security at the
> per-event level? We are selective about who can access which event, using
> selinux. That's how our ftrace-based tracers work. Its fine grained per-event
> control. That's where I was going with the tracefs approach since we get that
> granularity using the file system.
> 
> Thanks.
>
Alexei Starovoitov July 17, 2019, 1:24 a.m. UTC | #9
On Tue, Jul 16, 2019 at 07:55:00PM -0400, Joel Fernandes wrote:
> On Tue, Jul 16, 2019 at 06:41:50PM -0400, Joel Fernandes wrote:
> > On Tue, Jul 16, 2019 at 03:26:52PM -0700, Alexei Starovoitov wrote:
> > > On Tue, Jul 16, 2019 at 05:30:50PM -0400, Joel Fernandes wrote:
> > > > 
> > > > I also thought about the pinning idea before, but we also want to add support
> > > > for not just raw tracepoints, but also regular tracepoints (events if you
> > > > will). I am hesitant to add a new BPF API just for creating regular
> > > > tracepoints and then pinning those as well.
> > > 
> > > and they should be done through the pinning as well.
> > 
> > Hmm ok, I will give it some more thought.
> 
> I think I can make the new BPF API + pinning approach work, I will try to
> work on something like this and post it soon.
> 
> Also, I had a question below if you don't mind taking a look:
> 
> thanks Alexei!
> 
> > > > I don't see why a new bpf node for a trace event is a bad idea, really.
> > > 
> > > See the patches for kprobe/uprobe FD-based api and the reasons behind it.
> > > tldr: text is racy, doesn't scale, poor security, etc.
> > 
> > Is it possible to use perf without CAP_SYS_ADMIN and control security at the
> > per-event level? We are selective about who can access which event, using
> > selinux. That's how our ftrace-based tracers work. Its fine grained per-event
> > control. That's where I was going with the tracefs approach since we get that
> > granularity using the file system.

android's choice of selinux is not a factor in deciding kernel apis.
It's completely separate discusion wether disallowing particular tracepoints
for given user make sense at all.
Just because you can hack it in via selinux blocking particular
/sys/debug/tracing/ directory and convince yourself that it's somehow
makes android more secure. It doesn't mean that all new api should fit
into this model.
I think allowing one tracepoint and disallowing another is pointless
from security point of view. Tracing bpf program can do bpf_probe_read
of anything.
Alexei Starovoitov July 17, 2019, 1:30 a.m. UTC | #10
On Tue, Jul 16, 2019 at 06:31:17PM -0400, Steven Rostedt wrote:
> On Tue, 16 Jul 2019 17:30:50 -0400
> Joel Fernandes <joel@joelfernandes.org> wrote:
> 
> > I don't see why a new bpf node for a trace event is a bad idea, really.
> > tracefs is how we deal with trace events on Android. We do it in production
> > systems. This is a natural extension to that and fits with the security model
> > well.
> 
> What I would like to see is a way to have BPF inject data into the
> ftrace ring buffer directly. There's a bpf_trace_printk() that I find a
> bit of a hack (especially since it hooks into trace_printk() which is
> only for debugging purposes). Have a dedicated bpf ftrace ring
> buffer event that can be triggered is what I am looking for. Then comes
> the issue of what ring buffer to place it in, as ftrace can have
> multiple ring buffer instances. But these instances are defined by the
> tracefs instances directory. Having a way to associate a bpf program to
> a specific event in a specific tracefs directory could allow for ways to
> trigger writing into the correct ftrace buffer.

bpf can write anything (including full skb) into perf ring buffer.
I don't think there is a use case yet to write into ftrace ring buffer.
But I can be convinced otherwise :)

> But looking over the patches, I see what Alexei means that there's no
> overlap with ftrace and these patches except for the tracefs directory
> itself (which is part of the ftrace infrastructure). And the trace
> events are technically part of the ftrace infrastructure too. I see the
> tracefs interface being used, but I don't see how the bpf programs
> being added affect the ftrace ring buffer or other parts of ftrace. And
> I'm guessing that's what is confusing Alexei.

yep.
What I really like to see some day is proper integration of real ftrace
and bpf. So that bpf progs can be invoked from some of the ftrace machinery.
And the other way around.
Like I'd love to be able to start ftracing all functions from bpf
and stop it from bpf.
The use case is kernel debugging. I'd like to examine a bunch of condition
and start ftracing the execution. Then later decide wether this collection
of ip addresses is interesting to analyze and post process it quickly
while still inside bpf program.
Joel Fernandes July 17, 2019, 1:01 p.m. UTC | #11
On Tue, Jul 16, 2019 at 06:24:07PM -0700, Alexei Starovoitov wrote:
[snip]
> > > > > I don't see why a new bpf node for a trace event is a bad idea, really.
> > > > 
> > > > See the patches for kprobe/uprobe FD-based api and the reasons behind it.
> > > > tldr: text is racy, doesn't scale, poor security, etc.
> > > 
> > > Is it possible to use perf without CAP_SYS_ADMIN and control security at the
> > > per-event level? We are selective about who can access which event, using
> > > selinux. That's how our ftrace-based tracers work. Its fine grained per-event
> > > control. That's where I was going with the tracefs approach since we get that
> > > granularity using the file system.
> 
> android's choice of selinux is not a factor in deciding kernel apis.
> It's completely separate discusion wether disallowing particular tracepoints
> for given user make sense at all.
> Just because you can hack it in via selinux blocking particular
> /sys/debug/tracing/ directory and convince yourself that it's somehow
> makes android more secure. It doesn't mean that all new api should fit
> into this model.

Its not like a hack, it is just control of which tracefs node can be
accessed and which cannot be since the tracing can run on production systems
out in the field and there are several concerns to address like security,
privacy etc. It is not just for debugging usecases. We do collect traces out
in the field where these issues are real and cannot be ignored.

SELinux model is deny everything, and then selectively grant access to what
is needed. The VFS and security LSM hooks provide this control quite well. I am
not sure if such control is possible through perf hence I asked the question.

> I think allowing one tracepoint and disallowing another is pointless
> from security point of view. Tracing bpf program can do bpf_probe_read
> of anything.

I think the assumption here is the user controls the program instructions at
runtime, but that's not the case. The BPF program we are loading is not
dynamically generated, it is built at build time and it is loaded from a
secure verified partition, so even though it can do bpf_probe_read, it is
still not something that the user can change. And, we are planning to make it
even more secure by making it kernel verify the program at load time as well
(you were on some discussions about that a few months ago).

thanks,

 - Joel