[00/26] FUSE BPF: A Stacked Filesystem Extension for FUSE

Message ID	20220926231822.994383-1-drosen@google.com (mailing list archive)
Headers	show Return-Path: <linux-fsdevel-owner@kernel.org> Date: Mon, 26 Sep 2022 16:17:56 -0700 Mime-Version: 1.0 Message-ID: <20220926231822.994383-1-drosen@google.com> Subject: [PATCH 00/26] FUSE BPF: A Stacked Filesystem Extension for FUSE From: Daniel Rosenberg <drosen@google.com> To: Miklos Szeredi <miklos@szeredi.hu>, Alexei Starovoitov <ast@kernel.org>, Daniel Borkmann <daniel@iogearbox.net>, John Fastabend <john.fastabend@gmail.com> Cc: Andrii Nakryiko <andrii@kernel.org>, Martin KaFai Lau <martin.lau@linux.dev>, Song Liu <song@kernel.org>, Yonghong Song <yhs@fb.com>, KP Singh <kpsingh@kernel.org>, Stanislav Fomichev <sdf@google.com>, Hao Luo <haoluo@google.com>, Jiri Olsa <jolsa@kernel.org>, Daniel Rosenberg <drosen@google.com>, Paul Lawrence <paullawrence@google.com>, Alessio Balsini <balsini@google.com>, David Anderson <dvander@google.com>, Sandeep Patil <sspatil@google.com>, linux-fsdevel@vger.kernel.org, bpf@vger.kernel.org, kernel-team@android.com Content-Type: text/plain; charset="UTF-8" Precedence: bulk
Series	FUSE BPF: A Stacked Filesystem Extension for FUSE \| expand [00/26] FUSE BPF: A Stacked Filesystem Extension for FUSE [01/26] bpf: verifier: Allow for multiple packets [02/26] bpf: verifier: Allow single packet invalidation [03/26] fuse-bpf: Update uapi for fuse-bpf [04/26] fuse-bpf: Add BPF supporting functions [05/26] fs: Generic function to convert iocb to rw flags [06/26] bpf: Export bpf_prog_fops [07/26] fuse-bpf: Prepare for fuse-bpf patch [08/26] fuse: Add fuse-bpf, a stacked fs extension for FUSE [09/26] fuse-bpf: Don't support export_operations [10/26] fuse-bpf: Partially add mapping support [11/26] fuse-bpf: Add lseek support [12/26] fuse-bpf: Add support for fallocate [13/26] fuse-bpf: Support file/dir open/close [14/26] fuse-bpf: Support mknod/unlink/mkdir/rmdir [15/26] fuse-bpf: Add support for read/write iter [16/26] fuse-bpf: support FUSE_READDIR [17/26] fuse-bpf: Add support for sync operations [18/26] fuse-bpf: Add Rename support [19/26] fuse-bpf: Add attr support [20/26] fuse-bpf: Add support for FUSE_COPY_FILE_RANGE [21/26] fuse-bpf: Add xattr support [22/26] fuse-bpf: Add symlink/link support [23/26] fuse-bpf: allow mounting with no userspace daemon [24/26] fuse-bpf: Call bpf for pre/post filters [25/26] fuse-bpf: Add userspace pre/post filters [26/26] fuse-bpf: Add selftests

Daniel Rosenberg Sept. 26, 2022, 11:17 p.m. UTC

These patches extend FUSE to be able to act as a stacked filesystem. This
allows pure passthrough, where the fuse file system simply reflects the lower
filesystem, and also allows optional pre and post filtering in BPF and/or the
userspace daemon as needed. This can dramatically reduce or even eliminate
transitions to and from userspace.

Currently, we either set the backing file/bpf at mount time at the root level,
or at lookup time, via an optional block added at the end of the lookup return
call. The added lookup block contains an fd for the backing file/folder and bpf
if necessary, or a signal to clear or inherit the parent values. We're looking
into two options for extending this to mkdir/mknod/etc, as we currently only
support setting the backing to a pre-existing file, although naturally you can
create new ones. When we're doing a lookup for create, we could pass an
fd for the parent dir and the name of the backing file we're creating. This has
the benefit of avoiding an additional call to userspace, but requires hanging
on to some data in a negative dentry where there is no elegant place to store it.
Another option is adding the same block we added to lookup to the create type
op codes. This keeps that code more uniform, but means userspace must implement
that logic in more areas.

As is, the patches definitely need some work before they're ready. We still
need to go through and ensure we respect changed filter values/disallow changes
that don't make sense. We aren't currently calling mnt_want_write for the lower
calls where appropriate, and we don't have an override_creds layer either. We
also plan to add to our read/write iter filters to allow for more interesting
use cases. There are also probably some node id inconsistencies. For nodes that
will be completely passthrough, we give an id of 0.

For the BPF verification side, we have currently set things set up in the old
style, with a new bpf program type and helper functions. From LPC, my
understanding is that newer bpf additions are done in a new style, so I imagine
much of that will need to be redone as well, but hopefully these patches get
across what our needs there are.

For testing, we've provided the selftest code we have been using. We also have
a mode to run with no userspace daemon in a pure passthrough mode that I have
been running xfstests over to get some coverage on the backing operation code.
I had to modify mounts/unmounts to get that running, along with some other
small touch ups. The most notable failure I currently see there is in
generic/126, which I suspect is likely related to override_creds.


Alessio Balsini (1):
  fs: Generic function to convert iocb to rw flags

Daniel Rosenberg (25):
  bpf: verifier: Allow for multiple packets
  bpf: verifier: Allow single packet invalidation
  fuse-bpf: Update uapi for fuse-bpf
  fuse-bpf: Add BPF supporting functions
  bpf: Export bpf_prog_fops
  fuse-bpf: Prepare for fuse-bpf patch
  fuse: Add fuse-bpf, a stacked fs extension for FUSE
  fuse-bpf: Don't support export_operations
  fuse-bpf: Partially add mapping support
  fuse-bpf: Add lseek support
  fuse-bpf: Add support for fallocate
  fuse-bpf: Support file/dir open/close
  fuse-bpf: Support mknod/unlink/mkdir/rmdir
  fuse-bpf: Add support for read/write iter
  fuse-bpf: support FUSE_READDIR
  fuse-bpf: Add support for sync operations
  fuse-bpf: Add Rename support
  fuse-bpf: Add attr support
  fuse-bpf: Add support for FUSE_COPY_FILE_RANGE
  fuse-bpf: Add xattr support
  fuse-bpf: Add symlink/link support
  fuse-bpf: allow mounting with no userspace daemon
  fuse-bpf: Call bpf for pre/post filters
  fuse-bpf: Add userspace pre/post filters
  fuse-bpf: Add selftests

 fs/fuse/Kconfig                               |   10 +
 fs/fuse/Makefile                              |    1 +
 fs/fuse/backing.c                             | 2753 +++++++++++++++++
 fs/fuse/control.c                             |    2 +-
 fs/fuse/dev.c                                 |   33 +-
 fs/fuse/dir.c                                 |  443 ++-
 fs/fuse/file.c                                |  125 +-
 fs/fuse/fuse_i.h                              |  804 ++++-
 fs/fuse/inode.c                               |  292 +-
 fs/fuse/ioctl.c                               |    2 +-
 fs/fuse/readdir.c                             |   22 +
 fs/fuse/xattr.c                               |   36 +
 fs/overlayfs/file.c                           |   23 +-
 include/linux/bpf.h                           |    4 +
 include/linux/bpf_fuse.h                      |   64 +
 include/linux/bpf_types.h                     |    4 +
 include/linux/bpf_verifier.h                  |    5 +-
 include/linux/fs.h                            |    5 +
 include/uapi/linux/bpf.h                      |   33 +
 include/uapi/linux/fuse.h                     |   19 +-
 kernel/bpf/Makefile                           |    4 +
 kernel/bpf/bpf_fuse.c                         |  342 ++
 kernel/bpf/btf.c                              |    1 +
 kernel/bpf/core.c                             |    5 +
 kernel/bpf/syscall.c                          |    1 +
 kernel/bpf/verifier.c                         |  144 +-
 tools/include/uapi/linux/bpf.h                |   33 +
 tools/include/uapi/linux/fuse.h               | 1066 +++++++
 .../selftests/filesystems/fuse/.gitignore     |    2 +
 .../selftests/filesystems/fuse/Makefile       |   41 +
 .../testing/selftests/filesystems/fuse/OWNERS |    2 +
 .../selftests/filesystems/fuse/bpf_loader.c   |  798 +++++
 .../testing/selftests/filesystems/fuse/fd.txt |   21 +
 .../selftests/filesystems/fuse/fd_bpf.c       |  370 +++
 .../selftests/filesystems/fuse/fuse_daemon.c  |  294 ++
 .../selftests/filesystems/fuse/fuse_test.c    | 2147 +++++++++++++
 .../selftests/filesystems/fuse/test_bpf.c     |  800 +++++
 .../filesystems/fuse/test_framework.h         |  173 ++
 .../selftests/filesystems/fuse/test_fuse.h    |  328 ++
 39 files changed, 11017 insertions(+), 235 deletions(-)
 create mode 100644 fs/fuse/backing.c
 create mode 100644 include/linux/bpf_fuse.h
 create mode 100644 kernel/bpf/bpf_fuse.c
 create mode 100644 tools/include/uapi/linux/fuse.h
 create mode 100644 tools/testing/selftests/filesystems/fuse/.gitignore
 create mode 100644 tools/testing/selftests/filesystems/fuse/Makefile
 create mode 100644 tools/testing/selftests/filesystems/fuse/OWNERS
 create mode 100644 tools/testing/selftests/filesystems/fuse/bpf_loader.c
 create mode 100644 tools/testing/selftests/filesystems/fuse/fd.txt
 create mode 100644 tools/testing/selftests/filesystems/fuse/fd_bpf.c
 create mode 100644 tools/testing/selftests/filesystems/fuse/fuse_daemon.c
 create mode 100644 tools/testing/selftests/filesystems/fuse/fuse_test.c
 create mode 100644 tools/testing/selftests/filesystems/fuse/test_bpf.c
 create mode 100644 tools/testing/selftests/filesystems/fuse/test_framework.h
 create mode 100644 tools/testing/selftests/filesystems/fuse/test_fuse.h


base-commit: bf682942cd26ce9cd5e87f73ae099b383041e782

Martin KaFai Lau Sept. 28, 2022, 6:41 a.m. UTC | #1

On 9/26/22 4:17 PM, Daniel Rosenberg wrote:
> These patches extend FUSE to be able to act as a stacked filesystem. This
> allows pure passthrough, where the fuse file system simply reflects the lower
> filesystem, and also allows optional pre and post filtering in BPF and/or the
> userspace daemon as needed. This can dramatically reduce or even eliminate
> transitions to and from userspace.
> 
> Currently, we either set the backing file/bpf at mount time at the root level,
> or at lookup time, via an optional block added at the end of the lookup return
> call. The added lookup block contains an fd for the backing file/folder and bpf
> if necessary, or a signal to clear or inherit the parent values. We're looking
> into two options for extending this to mkdir/mknod/etc, as we currently only
> support setting the backing to a pre-existing file, although naturally you can
> create new ones. When we're doing a lookup for create, we could pass an
> fd for the parent dir and the name of the backing file we're creating. This has
> the benefit of avoiding an additional call to userspace, but requires hanging
> on to some data in a negative dentry where there is no elegant place to store it.
> Another option is adding the same block we added to lookup to the create type
> op codes. This keeps that code more uniform, but means userspace must implement
> that logic in more areas.
> 
> As is, the patches definitely need some work before they're ready. We still
> need to go through and ensure we respect changed filter values/disallow changes
> that don't make sense. We aren't currently calling mnt_want_write for the lower
> calls where appropriate, and we don't have an override_creds layer either. We
> also plan to add to our read/write iter filters to allow for more interesting
> use cases. There are also probably some node id inconsistencies. For nodes that
> will be completely passthrough, we give an id of 0.
> 
> For the BPF verification side, we have currently set things set up in the old
> style, with a new bpf program type and helper functions. From LPC, my
> understanding is that newer bpf additions are done in a new style, so I imagine
> much of that will need to be redone as well, but hopefully these patches get
> across what our needs there are.
> 
> For testing, we've provided the selftest code we have been using. We also have
> a mode to run with no userspace daemon in a pure passthrough mode that I have
> been running xfstests over to get some coverage on the backing operation code.
> I had to modify mounts/unmounts to get that running, along with some other
> small touch ups. The most notable failure I currently see there is in
> generic/126, which I suspect is likely related to override_creds.
> 

Interesting idea.

Some comments on review logistics:
- The set is too long and some of the individual patches are way too long for 
one single patch to review.  Keep in mind that not all of us here are experts in 
both fuse and bpf.  Making it easier to review first will help at the beginning. 
  Some ideas:

   - Only implement a few ops in the initial revision. From quickly browsing the 
set, it is implementing the 'struct file_operations fuse_file_operations'? 
Maybe the first few revisions can start with a few of the ops first.

   - Please make the patches that can be applied to the bpf-next tree cleanly. 
For example, in patch 3, where is 18e2ec5bf453 coming from? I cannot find it in 
bpf-next and linux-next tree.
   - Without applying it to an upstream tree cleanly, in a big set like this, I 
have no idea when bpf_prog_run() is called in patch 24 because the diff context 
is in fuse_bpf_cleanup and apparently it is not where the bpf prog is run.

Some high level comments on the set:
- Instead of adding bpf helpers, you should consider kfunc instead. You can take 
a look at the recent HID patchset v10 or the recent nf conntrack bpf set.

- Instead of expressing as packet data, using the recent dynptr is a better way 
to go for handling a mem blob.

- iiuc, the idea is to allow bpf prog to optionally handle the 'struct 
file_operations' without going back to the user daemon? Have you looked at 
struct_ops which seems to be a better fit here?  If the bpf prog does not know 
how to handle an operation (or file?), it can call fuse_file_llseek (for 
example) as a kfunc to handle the request.

- The test SEC("test_trace") seems mostly a synthetic test for checking 
correctness.  Does it have a test that shows a more real life use case? or I 
have missed things in patch 26?

- Please use the skel to load the program.  It is pretty hard to read the loader 
in patch 26.

- I assume the main objective is for performance by not going back to the user 
daemon?  Do you have performance number?

Brian Foster Sept. 28, 2022, 12:31 p.m. UTC | #2

On Tue, Sep 27, 2022 at 11:41:50PM -0700, Martin KaFai Lau wrote:
> On 9/26/22 4:17 PM, Daniel Rosenberg wrote:
> > These patches extend FUSE to be able to act as a stacked filesystem. This
> > allows pure passthrough, where the fuse file system simply reflects the lower
> > filesystem, and also allows optional pre and post filtering in BPF and/or the
> > userspace daemon as needed. This can dramatically reduce or even eliminate
> > transitions to and from userspace.
> > 
> > Currently, we either set the backing file/bpf at mount time at the root level,
> > or at lookup time, via an optional block added at the end of the lookup return
> > call. The added lookup block contains an fd for the backing file/folder and bpf
> > if necessary, or a signal to clear or inherit the parent values. We're looking
> > into two options for extending this to mkdir/mknod/etc, as we currently only
> > support setting the backing to a pre-existing file, although naturally you can
> > create new ones. When we're doing a lookup for create, we could pass an
> > fd for the parent dir and the name of the backing file we're creating. This has
> > the benefit of avoiding an additional call to userspace, but requires hanging
> > on to some data in a negative dentry where there is no elegant place to store it.
> > Another option is adding the same block we added to lookup to the create type
> > op codes. This keeps that code more uniform, but means userspace must implement
> > that logic in more areas.
> > 
> > As is, the patches definitely need some work before they're ready. We still
> > need to go through and ensure we respect changed filter values/disallow changes
> > that don't make sense. We aren't currently calling mnt_want_write for the lower
> > calls where appropriate, and we don't have an override_creds layer either. We
> > also plan to add to our read/write iter filters to allow for more interesting
> > use cases. There are also probably some node id inconsistencies. For nodes that
> > will be completely passthrough, we give an id of 0.
> > 
> > For the BPF verification side, we have currently set things set up in the old
> > style, with a new bpf program type and helper functions. From LPC, my
> > understanding is that newer bpf additions are done in a new style, so I imagine
> > much of that will need to be redone as well, but hopefully these patches get
> > across what our needs there are.
> > 
> > For testing, we've provided the selftest code we have been using. We also have
> > a mode to run with no userspace daemon in a pure passthrough mode that I have
> > been running xfstests over to get some coverage on the backing operation code.
> > I had to modify mounts/unmounts to get that running, along with some other
> > small touch ups. The most notable failure I currently see there is in
> > generic/126, which I suspect is likely related to override_creds.
> > 
> 
> Interesting idea.
> 
> Some comments on review logistics:
> - The set is too long and some of the individual patches are way too long
> for one single patch to review.  Keep in mind that not all of us here are
> experts in both fuse and bpf.  Making it easier to review first will help at
> the beginning.  Some ideas:
> 
>   - Only implement a few ops in the initial revision. From quickly browsing
> the set, it is implementing the 'struct file_operations
> fuse_file_operations'? Maybe the first few revisions can start with a few of
> the ops first.
> 

I had a similar thought when poking through this. A related question I
had is how much of a functional dependency does the core passthrough
mechanism have on bpf? If bpf is optional for filtering purposes and
isn't absolutely necessary to set up a basic form of passthrough, I
think review would be made easier by splitting off those core bits from
the bpf components so each part is easier to review by people who know
them best. For example, introduce all the fuse enhancements, hooks and
cleanups to set up a passthrough to start the series, then plumb in the
bpf filtering magic on top. Hm?

FWIW, if this is an RFC/prototype and you want more efficient review
cycles, another idea to take that a step further could be to start with
read-only support (or maybe even just directory walking?).

BTW if the bpf bits are optional, how might one mount a fuse/no
daemon/passthrough filesystem from userspace? Is that possible with this
series as is?

Something more on the fuse side.. it looks like we introduce a pattern
where bits of generic request completion processing can end up
duplicated between the shortcut (i.e.  _backing()/_finalize()) handlers
and the traditional post request code, because the shortcuts basically
bypass the entire rest of the codepath. For example, something like
create_new_entry() is currently reused for several inode creation
operations. With passthrough mode, it looks like some of that code (i.e.
vfs dentry fixups) is split off from create_new_entry() into each
individual backing mode handler.

It looks like much of the lower level request processing code was
refactored into the fuse_iqueue to support things like virtiofs. Have
you looked into whether that abstraction can be reused or enhanced to
support bpf filtering, direct passthrough calls, etc.? Or perhaps
whether more of the higher level code could be refactored in a similar
way to encourage more reuse and avoid branching off every fs operation
into a special passthrough codepath?

Brian

>   - Please make the patches that can be applied to the bpf-next tree
> cleanly. For example, in patch 3, where is 18e2ec5bf453 coming from? I
> cannot find it in bpf-next and linux-next tree.
>   - Without applying it to an upstream tree cleanly, in a big set like this,
> I have no idea when bpf_prog_run() is called in patch 24 because the diff
> context is in fuse_bpf_cleanup and apparently it is not where the bpf prog
> is run.
> 
> Some high level comments on the set:
> - Instead of adding bpf helpers, you should consider kfunc instead. You can
> take a look at the recent HID patchset v10 or the recent nf conntrack bpf
> set.
> 
> - Instead of expressing as packet data, using the recent dynptr is a better
> way to go for handling a mem blob.
> 
> - iiuc, the idea is to allow bpf prog to optionally handle the 'struct
> file_operations' without going back to the user daemon? Have you looked at
> struct_ops which seems to be a better fit here?  If the bpf prog does not
> know how to handle an operation (or file?), it can call fuse_file_llseek
> (for example) as a kfunc to handle the request.
> 
> - The test SEC("test_trace") seems mostly a synthetic test for checking
> correctness.  Does it have a test that shows a more real life use case? or I
> have missed things in patch 26?
> 
> - Please use the skel to load the program.  It is pretty hard to read the
> loader in patch 26.
> 
> - I assume the main objective is for performance by not going back to the
> user daemon?  Do you have performance number?
>

Daniel Rosenberg Oct. 1, 2022, 12:05 a.m. UTC | #3

On Tue, Sep 27, 2022 at 11:41 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>
> Interesting idea.
>
> Some comments on review logistics:
> - The set is too long and some of the individual patches are way too long for
> one single patch to review.  Keep in mind that not all of us here are experts in
> both fuse and bpf.  Making it easier to review first will help at the beginning.
>   Some ideas:
>
>    - Only implement a few ops in the initial revision. From quickly browsing the
> set, it is implementing the 'struct file_operations fuse_file_operations'?
> Maybe the first few revisions can start with a few of the ops first.
>

I've split it up a fair bit already, do you mean just sending a subset
of them at a time? I think the current splitting roughly allows for
that. Patch 1-4 and 5 deal with bpf/verifier code which isn't used
until patch 24. I can reorder/split up the opcodes arbitrarily.
Putting the op codes that implement file passthrough first makes
sense. The code is much easier to test when all/most are present,
since then I can just use patch 23 to mount without a daemon and run
xfs tests on them. At least initially I felt the whole stack was
useful to give the full picture.

>    - Please make the patches that can be applied to the bpf-next tree cleanly.
> For example, in patch 3, where is 18e2ec5bf453 coming from? I cannot find it in
> bpf-next and linux-next tree.
>    - Without applying it to an upstream tree cleanly, in a big set like this, I
> have no idea when bpf_prog_run() is called in patch 24 because the diff context
> is in fuse_bpf_cleanup and apparently it is not where the bpf prog is run.
>

Currently this is based off of
bf682942cd26ce9cd5e87f73ae099b383041e782 in
git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
I would have rebased on top of bpf-next, except that from my
conversations at plumbers, I figured that the set up would need to
change significantly, and that effort would be wasted. My goal
including them here was to give more of a sense of what our needs are,
and be a starting point for working out what we really ought to be
using.

> Some high level comments on the set:
> - Instead of adding bpf helpers, you should consider kfunc instead. You can take
> a look at the recent HID patchset v10 or the recent nf conntrack bpf set.
>
> - Instead of expressing as packet data, using the recent dynptr is a better way
> to go for handling a mem blob.
>

I'll look into those, I remember them coming up at LPC. My current use
of packets/buffers does seem to abuse their intended meaning a bit.

> - iiuc, the idea is to allow bpf prog to optionally handle the 'struct
> file_operations' without going back to the user daemon? Have you looked at
> struct_ops which seems to be a better fit here?  If the bpf prog does not know
> how to handle an operation (or file?), it can call fuse_file_llseek (for
> example) as a kfunc to handle the request.
>

I wasn't aware of struct_ops. It looks like that may work for us
instead of making a new prog type. I'll definitely look into that.
I'll likely sign up for the bpf office hours next week.

> - The test SEC("test_trace") seems mostly a synthetic test for checking
> correctness.  Does it have a test that shows a more real life use case? or I
> have missed things in patch 26?
>

Patch 26 is pretty much all synthetic tests. A lot of them are just
ensuring that we even call in to the bpf program, and some limited
testing that changing the filters has the expected results. We
mentioned a few more concrete usecases in the LPC talk. One of those
is folder hiding. In Android we've had an issue with leaking the
existence of some apps to other apps. We needed to hide certain
directories from apps where they do have permissions to traverse the
directory. Attempting to access a folder without permissions would
result in EPERM, revealing the existence of that folder. Since the
application doesn't have permission to create arbitrary folders at
that level, we can hide it by using fuse-bpf to change the EPERM into
an ENOENT, and then filter readdir to remove disallowed entries. You
can see something like that in bpf_test_redact_readdir. We also have
some file level redaction. If an app doesn't have location
permissions, but does have file permissions, they could just read
picture metadata to get location information. We could have bpf
redirect reads that might contain location data to the daemon, while
passing through other parts.

> - Please use the skel to load the program.  It is pretty hard to read the loader
> in patch 26.

Yeah, patch 26 is not in great shape currently. I included it mostly
as something that exercises the code, and contains some example bpf
programs. Any suggestions on setting up the tests better are
appreciated.

>
> - I assume the main objective is for performance by not going back to the user
> daemon?  Do you have performance number?
>

I don't have any on hand from the current version. It's a little
tricky to know what numbers are relevant here since the numbers will
change greatly depending on what you do with it. In pure passthrough
without a bpf program, we were seeing performance pretty comparable to
the lower filesystem. Depending on how large of a bpf program we use
we were seeing pretty different slowdowns from that, though at least
some of that was from having a large switch statement.

Do you have any suggestions on what to test here? My thoughts would be
comparing lower fs performance to pure passthrough, to maybe the
example ll_passthrough in libfuse, but that doesn't really show any
bpf impact.

-Daniel

Alexei Starovoitov Oct. 1, 2022, 12:24 a.m. UTC | #4

On Fri, Sep 30, 2022 at 5:05 PM Daniel Rosenberg <drosen@google.com> wrote:
>
> >    - Please make the patches that can be applied to the bpf-next tree cleanly.
> > For example, in patch 3, where is 18e2ec5bf453 coming from? I cannot find it in
> > bpf-next and linux-next tree.
> >    - Without applying it to an upstream tree cleanly, in a big set like this, I
> > have no idea when bpf_prog_run() is called in patch 24 because the diff context
> > is in fuse_bpf_cleanup and apparently it is not where the bpf prog is run.
> >
>
> Currently this is based off of
> bf682942cd26ce9cd5e87f73ae099b383041e782 in
> git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
> I would have rebased on top of bpf-next, except that from my
> conversations at plumbers, I figured that the set up would need to
> change significantly, and that effort would be wasted. My goal
> including them here was to give more of a sense of what our needs are,
> and be a starting point for working out what we really ought to be
> using.

It was a good idea to send it early :)

>
> > Some high level comments on the set:
> > - Instead of adding bpf helpers, you should consider kfunc instead. You can take
> > a look at the recent HID patchset v10 or the recent nf conntrack bpf set.
> >
> > - Instead of expressing as packet data, using the recent dynptr is a better way
> > to go for handling a mem blob.
> >
>
> I'll look into those, I remember them coming up at LPC. My current use
> of packets/buffers does seem to abuse their intended meaning a bit.

This 'abuse' is sorta, kinda, ok-ish. We can accept that,
but once you convert to kfunc interface you might realize that
"packet" abstraction is not necessary here and there are
cleaner alternatives. Have you looked at dynptr ?

> > - iiuc, the idea is to allow bpf prog to optionally handle the 'struct
> > file_operations' without going back to the user daemon? Have you looked at
> > struct_ops which seems to be a better fit here?  If the bpf prog does not know
> > how to handle an operation (or file?), it can call fuse_file_llseek (for
> > example) as a kfunc to handle the request.
> >
>
> I wasn't aware of struct_ops. It looks like that may work for us
> instead of making a new prog type. I'll definitely look into that.
> I'll likely sign up for the bpf office hours next week.

I have to second everything that Martin suggested.

To reiterate his points in different words:
. patch 26 with printk debug only gives very low
confidence that the presented api towards bpf programs
will be usable.
The patch series gotta have a production worthy bpf program
that actually does things you want it to do.

. please use kfunc mechanism similar to the way hid-bpf is doing.
If individual funcs are not enough and you need to attach
a set of bpf programs all at once then use struct_ops.
kfuncs are prefered if you don't need atomicity of a set of progs.
If it's fine to attach progs one at a time to different nop==empty
functions than just use a set of nop funcs and call back into
the kernel with kfuncs.
That would be easier to rip out when api turns out to
be insufficient or extensions are necessary.

> > - The test SEC("test_trace") seems mostly a synthetic test for checking
> > correctness.  Does it have a test that shows a more real life use case? or I
> > have missed things in patch 26?
> >
>
> Patch 26 is pretty much all synthetic tests. A lot of them are just
> ensuring that we even call in to the bpf program, and some limited
> testing that changing the filters has the expected results. We
> mentioned a few more concrete usecases in the LPC talk. One of those
> is folder hiding. In Android we've had an issue with leaking the
> existence of some apps to other apps. We needed to hide certain
> directories from apps where they do have permissions to traverse the
> directory. Attempting to access a folder without permissions would
> result in EPERM, revealing the existence of that folder. Since the
> application doesn't have permission to create arbitrary folders at
> that level, we can hide it by using fuse-bpf to change the EPERM into
> an ENOENT, and then filter readdir to remove disallowed entries. You
> can see something like that in bpf_test_redact_readdir. We also have
> some file level redaction. If an app doesn't have location
> permissions, but does have file permissions, they could just read
> picture metadata to get location information. We could have bpf
> redirect reads that might contain location data to the daemon, while
> passing through other parts.

The described use cases sound useful.
Just present them as real bpf progs.

My main concern, though, is with patches 7-25.
I don't understand file systems, but it looks like the patches
bring features into fuse from other file systems.
In many ways it looks like overlayfs.
Maybe just add bpf hooks to overlayfs?
Why fuse at all?
It might look like a bunch of nop functions sprinkled around overlayfs
code.
If you need to talk to user space from bpf prog you'll have
ringbuf and user_ringbuf to stream data from bpf prog to user
and back.

Daniel Rosenberg Oct. 1, 2022, 12:47 a.m. UTC | #5

On Wed, Sep 28, 2022 at 5:31 AM Brian Foster <bfoster@redhat.com> wrote:
>
> I had a similar thought when poking through this. A related question I
> had is how much of a functional dependency does the core passthrough
> mechanism have on bpf? If bpf is optional for filtering purposes and
> isn't absolutely necessary to set up a basic form of passthrough, I
> think review would be made easier by splitting off those core bits from
> the bpf components so each part is easier to review by people who know
> them best. For example, introduce all the fuse enhancements, hooks and
> cleanups to set up a passthrough to start the series, then plumb in the
> bpf filtering magic on top. Hm?
>

The passthrough code has no dependency on the bpf functionality. I can
reorder these patches to not have any bpf changes until patch 24. I'll
probably change the order like I described in my previous email. The
patches do become a lot more useful once the pre/post filters enter
the mix though.

> BTW if the bpf bits are optional, how might one mount a fuse/no
> daemon/passthrough filesystem from userspace? Is that possible with this
> series as is?
>
This is provided by patch 23. You can mount with the "no_daemon"
option. Anywhere FUSE attempts to call the daemon will end up with an
error, since the daemon is not connected. If you pair this with
"root_dir=[fd]" and optionally "root_bpf=[fd]", you can run in a
daemon-less passthrough mode. It's a bit less exciting though, since
at that point you're kind of doing a bind mount with extra steps.
Useful for testing though, and in theory you may be able to implement
most of a daemon in bpf.

> Something more on the fuse side.. it looks like we introduce a pattern
> where bits of generic request completion processing can end up
> duplicated between the shortcut (i.e.  _backing()/_finalize()) handlers
> and the traditional post request code, because the shortcuts basically
> bypass the entire rest of the codepath. For example, something like
> create_new_entry() is currently reused for several inode creation
> operations. With passthrough mode, it looks like some of that code (i.e.
> vfs dentry fixups) is split off from create_new_entry() into each
> individual backing mode handler.
>
> It looks like much of the lower level request processing code was
> refactored into the fuse_iqueue to support things like virtiofs. Have
> you looked into whether that abstraction can be reused or enhanced to
> support bpf filtering, direct passthrough calls, etc.? Or perhaps
> whether more of the higher level code could be refactored in a similar
> way to encourage more reuse and avoid branching off every fs operation
> into a special passthrough codepath?
>
> Brian
>

The largest opportunity for reducing duplicate code would probably be
trying to unify the backing calls between overlayfs and our work here.
In places where you need to do more work than directly calling the
relevant vfs calls we probably could factor out some common helpers. I
haven't looked too much into that yet since I want to see where the
fuse-bpf code ends up before I try to commit to that. I've thought
about unifying some of the code around node creation in the backing
implementations, but haven't gotten around to it yet. We definitely
need to branch off for every operation though, since fuse otherwise
has no concept of the backing filesystem. We do have some more work to
do to ensure there is a clean handoff between regular fuse and
fuse-bpf. The goal is to be able to handle just the parts you need to
in the daemon, while the rest can be passed through if you're acting
as a stacked filesystem. There are some oddities around things fuse
does for efficiency that fuse-bpf doesn't need to do. For instance, if
you're using passthrough for getattr, you don't really need to do a
readdir_plus, since you don't have to worry about all the extra daemon
requests.

-Daniel

Martin KaFai Lau Oct. 6, 2022, 1:58 a.m. UTC | #6

On 9/30/22 5:05 PM, Daniel Rosenberg wrote:
> On Tue, Sep 27, 2022 at 11:41 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>>
>> Interesting idea.
>>
>> Some comments on review logistics:
>> - The set is too long and some of the individual patches are way too long for
>> one single patch to review.  Keep in mind that not all of us here are experts in
>> both fuse and bpf.  Making it easier to review first will help at the beginning.
>>    Some ideas:
>>
>>     - Only implement a few ops in the initial revision. From quickly browsing the
>> set, it is implementing the 'struct file_operations fuse_file_operations'?
>> Maybe the first few revisions can start with a few of the ops first.
>>
> 
> I've split it up a fair bit already, do you mean just sending a subset
> of them at a time? I think the current splitting roughly allows for
> that. Patch 1-4 and 5 deal with bpf/verifier code which isn't used
> until patch 24. I can reorder/split up the opcodes arbitrarily.

> Putting the op codes that implement file passthrough first makes
> sense. The code is much easier to test when all/most are present,
> since then I can just use patch 23 to mount without a daemon and run
> xfs tests on them. At least initially I felt the whole stack was
> useful to give the full picture.

I don't mind to have all op codes in each re-spin as long as it can apply 
cleanly to bpf-next where the bpf implementation part will eventually land. 
Patch 26 has to split up though.  It is a few thousand lines in one patch.

I was just thinking to only do a few op codes, eg. the few android use cases you 
have mentioned.  My feeling is other op codes should not be very different in 
term of the bpf side implementation (or it is not true?).  When the patch set 
getting enough traction, then start adding more op codes in the later revisions. 
  That will likely help to re-spin faster and save you time also.

>> - iiuc, the idea is to allow bpf prog to optionally handle the 'struct
>> file_operations' without going back to the user daemon? Have you looked at
>> struct_ops which seems to be a better fit here?  If the bpf prog does not know
>> how to handle an operation (or file?), it can call fuse_file_llseek (for
>> example) as a kfunc to handle the request.
>>
> 
> I wasn't aware of struct_ops. It looks like that may work for us
> instead of making a new prog type. I'll definitely look into that.
> I'll likely sign up for the bpf office hours next week.

You can take a look at the tools/testing/selftests/bpf/progs/bpf_cubic.c.
It implements the whole tcp congestion in bpf. In particular, the bpf prog is 
implementing the kernel 'struct tcp_congestion_ops'.  That selftest example is 
pretty much a direct copy from the kernel net/ipv4/tcp_cubic.c.  Also, in 
BPF_STRUCT_OPS(bpf_cubic_undo_cwnd, ...), it is directly calling the kfunc's 
tcp_reno_undo_cwnd() when the bpf prog does not need to do anything different 
from the kernel's tcp_reno_undo_cwnd().  Look at how it is marked as __ksym in 
bpf_cubic.c

However, echoing Alexei's earlier reply, struct_ops is good when it needs to 
implement a well defined 'struct xyz_operations' that has all function pointer 
in it.  Taking another skim at the set, it seems like it is mostly trying to 
intercept the fuse_simple_request() call?

[00/26] FUSE BPF: A Stacked Filesystem Extension for FUSE

Message

Comments