mbox series

[RESEND,RFC,bpf-next,v1,0/8] Pinning bpf objects outside bpffs

Message ID 20220112193152.3058718-1-haoluo@google.com (mailing list archive)
Headers show
Series Pinning bpf objects outside bpffs | expand

Message

Hao Luo Jan. 12, 2022, 7:31 p.m. UTC
Bpffs is a pseudo file system that persists bpf objects. Previously
bpf objects can only be pinned in bpffs, this patchset extends pinning
to allow bpf objects to be pinned (or exposed) to other file systems.

In particular, this patchset allows pinning bpf objects in kernfs. This
creates a new file entry in the kernfs file system and the created file
is able to reference the bpf object. By doing so, bpf can be used to
customize the file's operations, such as seq_show.

As a concrete usecase of this feature, this patchset introduces a
simple new program type called 'bpf_view', which can be used to format
a seq file by a kernel object's state. By pinning a bpf_view program
into a cgroup directory, userspace is able to read the cgroup's state
from file in a format defined by the bpf program.

Different from bpffs, kernfs doesn't have a callback when a kernfs node
is freed, which is problem if we allow the kernfs node to hold an extra
reference of the bpf object, because there is no chance to dec the
object's refcnt. Therefore the kernfs node created by pinning doesn't
hold reference of the bpf object. The lifetime of the kernfs node
depends on the lifetime of the bpf object. Rather than "pinning in
kernfs", it is "exposing to kernfs". We require the bpf object to be
pinned in bpffs first before it can be pinned in kernfs. When the
object is unpinned from bpffs, their kernfs nodes will be removed
automatically. This somehow treats a pinned bpf object as a persistent
"device".

We rely on fsnotify to monitor the inode events in bpffs. A new function
bpf_watch_inode() is introduced. It allows registering a callback
function at inode destruction. For the kernfs case, a callback that
removes kernfs node is registered at the destruction of bpffs inodes.
For other file systems such as sockfs, bpf_watch_inode() can monitor the
destruction of sockfs inodes and the created file entry can hold the bpf
object's reference. In this case, it is truly "pinning".

File operations other than seq_show can also be implemented using bpf.
For example, bpf may be of help for .poll and .mmap in kernfs.

Patch organization:
 - patch 1/8 and 2/8 are preparations. 1/8 implements bpf_watch_inode();
   2/8 records bpffs inode in bpf object.
 - patch 3/8 and 4/8 implement generic logic for creating bpf backed
   kernfs file.
 - patch 5/8 and 6/8 add a new program type for formatting output.
 - patch 7/8 implements cgroup seq_show operation using bpf.
 - patch 8/8 adds selftest.

Hao Luo (8):
  bpf: Support pinning in non-bpf file system.
  bpf: Record back pointer to the inode in bpffs
  bpf: Expose bpf object in kernfs
  bpf: Support removing kernfs entries
  bpf: Introduce a new program type bpf_view.
  libbpf: Support of bpf_view prog type.
  bpf: Add seq_show operation for bpf in cgroupfs
  selftests/bpf: Test exposing bpf objects in kernfs

 include/linux/bpf.h                           |   9 +-
 include/uapi/linux/bpf.h                      |   2 +
 kernel/bpf/Makefile                           |   2 +-
 kernel/bpf/bpf_view.c                         | 190 ++++++++++++++
 kernel/bpf/bpf_view.h                         |  25 ++
 kernel/bpf/inode.c                            | 219 ++++++++++++++--
 kernel/bpf/inode.h                            |  54 ++++
 kernel/bpf/kernfs_node.c                      | 165 ++++++++++++
 kernel/bpf/syscall.c                          |   3 +
 kernel/bpf/verifier.c                         |   6 +
 kernel/trace/bpf_trace.c                      |  12 +-
 tools/include/uapi/linux/bpf.h                |   2 +
 tools/lib/bpf/libbpf.c                        |  21 ++
 .../selftests/bpf/prog_tests/pinning_kernfs.c | 245 ++++++++++++++++++
 .../selftests/bpf/progs/pinning_kernfs.c      |  72 +++++
 15 files changed, 995 insertions(+), 32 deletions(-)
 create mode 100644 kernel/bpf/bpf_view.c
 create mode 100644 kernel/bpf/bpf_view.h
 create mode 100644 kernel/bpf/inode.h
 create mode 100644 kernel/bpf/kernfs_node.c
 create mode 100644 tools/testing/selftests/bpf/prog_tests/pinning_kernfs.c
 create mode 100644 tools/testing/selftests/bpf/progs/pinning_kernfs.c

Comments

Tejun Heo Jan. 12, 2022, 10:37 p.m. UTC | #1
Hello, Hao.

On Wed, Jan 12, 2022 at 11:31:44AM -0800, Hao Luo wrote:
> As a concrete usecase of this feature, this patchset introduces a
> simple new program type called 'bpf_view', which can be used to format
> a seq file by a kernel object's state. By pinning a bpf_view program
> into a cgroup directory, userspace is able to read the cgroup's state
> from file in a format defined by the bpf program.

Both kernfs users - sysfs and cgroups - are hard APIs just as procfs, so
allowing bpf programs to add arbitrarily formatted files anywhere likely
isn't a good idea. Even ignoring the hard-defined interface problem,
allowing arbitrary files can cause surprising failures through namespace
collisions (which may be worked around by restricting where these files
reside or how they're named).

While the attraction is understandable, I think this is a misguided
direction. Text file interfaces are okay, or sometimes even good, for
certain things - communicating well established small piece of information.
They're easy to access and as long as the format stays really stable, the
million parsers that they end up spawning are mostly manageable although you
inevitably end up with "I was reading 3rd field of 4th line and you added a
new line above!".

The above also illustrates the problems with using these text file
interfaces. They're good for really static stuff or something really
provisional like for debugging where there's only one or very few consumers.
Outside of those extremes, they become pretty terrible. They are very
inefficient when the data volume isn't trivial. There's no good way to
synchronize accesses to multiple files. There are million ways to parse
them, many of them ever so subtly wrong. There's no good way to version the
interface (not that you can't). And if you throw these flexible files in the
midst of other hard-API files, it'll severely exacerbate confusion.

Also, for something which isn't stable, I think it's better to have some of
the access logic on the reader side. For example, let's say you've been
using data from a cgroup file on the system. One day, the system boots and
the file isn't there. How would you debug that? If it were a, say, py-bcc
script which fetched data through bpf, it wouldn't be difficult to track.
This isn't just happenstance - if you're reaching into a black box to get
data, you better keep that mechanism close to you as it's a fragile
temporary thing prone to breaking.

Yet another argument against it is that the kernel is a really bad place to
process and format data. We can't do real percentiles, or any kind of
non-trivial numeric analysis, or even handle and format real numbers. Given
that bpf already has an a lot more efficient way to transfer data between
kernel and user, it doesn't make sense to push data formatting into the
kernel. Export data to userspace, process and format there.

If there are reasons why this isn't very convenient using bpf. I think the
right thing to do is improving those. One issue that Song raised was that
there's no easy to allow non-root users to run specific bpf programs and
even if we do that with SUID binaries, each execution would be pretty
expensive involving full verification run and so on. But those problems are
solvable - maybe known BPF programs can be cached and made available to
other users - and I think concentrating on such direction would be way more
fruitful for wider purposes than trying to make BPF show text files in
existing fixed interfaces.

Thanks.
Hao Luo Jan. 13, 2022, 3:42 a.m. UTC | #2
On Wed, Jan 12, 2022 at 2:37 PM Tejun Heo <tj@kernel.org> wrote:
>
> Hello, Hao.
>

Thanks Tejun for your insights.

> On Wed, Jan 12, 2022 at 11:31:44AM -0800, Hao Luo wrote:
> > As a concrete usecase of this feature, this patchset introduces a
> > simple new program type called 'bpf_view', which can be used to format
> > a seq file by a kernel object's state. By pinning a bpf_view program
> > into a cgroup directory, userspace is able to read the cgroup's state
> > from file in a format defined by the bpf program.
>
> Both kernfs users - sysfs and cgroups - are hard APIs just as procfs, so
> allowing bpf programs to add arbitrarily formatted files anywhere likely
> isn't a good idea. Even ignoring the hard-defined interface problem,
> allowing arbitrary files can cause surprising failures through namespace
> collisions (which may be worked around by restricting where these files
> reside or how they're named).
>
> While the attraction is understandable, I think this is a misguided
> direction. Text file interfaces are okay, or sometimes even good, for
> certain things - communicating well established small piece of information.
> They're easy to access and as long as the format stays really stable, the
> million parsers that they end up spawning are mostly manageable although you
> inevitably end up with "I was reading 3rd field of 4th line and you added a
> new line above!".
>
> The above also illustrates the problems with using these text file
> interfaces. They're good for really static stuff or something really
> provisional like for debugging where there's only one or very few consumers.
> Outside of those extremes, they become pretty terrible. They are very
> inefficient when the data volume isn't trivial. There's no good way to
> synchronize accesses to multiple files. There are million ways to parse
> them, many of them ever so subtly wrong. There's no good way to version the
> interface (not that you can't). And if you throw these flexible files in the
> midst of other hard-API files, it'll severely exacerbate confusion.
>

I understand the importance of a set of hard APIs and appreciate the
effort maintainers put on maintaining them. I acknowledge the problems
of text file interfaces mentioned above. But there are situations
where the text file interface also provides great value, in a sense, I
think, outweighs its limitations. Bpf iter has many great
applications. Bpf iter could be made more efficient, providing greater
value.

I agree that mixing flexible files with hard-API files is a problem.
And my understanding is that, it's the key concern here. It would be
great if there is a way to separate the bpf files from the stable
APIs. I'm now thinking along this direction.

> Also, for something which isn't stable, I think it's better to have some of
> the access logic on the reader side. For example, let's say you've been
> using data from a cgroup file on the system. One day, the system boots and
> the file isn't there. How would you debug that? If it were a, say, py-bcc
> script which fetched data through bpf, it wouldn't be difficult to track.
> This isn't just happenstance - if you're reaching into a black box to get
> data, you better keep that mechanism close to you as it's a fragile
> temporary thing prone to breaking.
>

From the view of userspace, allowing bpf to define kernel interface is
giving the userspace full control. With everything controlled in
userspace, debugging is easier rather than harder in my understanding.
Access logic on the reader side is always needed of course. I've seen
bugs where even stable files in cgroupfs are seemingly missing, which
is harder to debug than bpf loading failure.

> Yet another argument against it is that the kernel is a really bad place to
> process and format data. We can't do real percentiles, or any kind of
> non-trivial numeric analysis, or even handle and format real numbers. Given
> that bpf already has an a lot more efficient way to transfer data between
> kernel and user, it doesn't make sense to push data formatting into the
> kernel. Export data to userspace, process and format there.
>

There is a plus side of processing and formatting data in the kernel.
Not every type of handling needs to process real percentiles or format
real numbers. A big saving in cpu cycles can be achieved by the power
of customizing data encoding inside the kernel.

Currently many data are exported in text, userspace parses them and
encodes them in a format suitable for network transmission (for
example, Protobuf). If the kernel can encode the data directly in its
final format, that would save the cpu cycles spent on encoding to and
decoding from text. Multiplying this saving by the frequency of data
collections and scale of the data center, the number can be
significant. Bpf can transfer data in binary, but there is currently
no good way to control data encoding and organize data in cgroups, as
far as I know.

> If there are reasons why this isn't very convenient using bpf. I think the
> right thing to do is improving those. One issue that Song raised was that
> there's no easy to allow non-root users to run specific bpf programs and
> even if we do that with SUID binaries, each execution would be pretty
> expensive involving full verification run and so on. But those problems are
> solvable - maybe known BPF programs can be cached and made available to
> other users - and I think concentrating on such direction would be way more
> fruitful for wider purposes than trying to make BPF show text files in
> existing fixed interfaces.
>
> Thanks.
>
> --
> tejun

Thanks,

Hao