mbox series

[v5,00/18] fanotify: add pre-content hooks

Message ID cover.1725481503.git.josef@toxicpanda.com (mailing list archive)
Headers show
Series fanotify: add pre-content hooks | expand

Message

Josef Bacik Sept. 4, 2024, 8:27 p.m. UTC
v4: https://lore.kernel.org/linux-fsdevel/cover.1723670362.git.josef@toxicpanda.com/
v3: https://lore.kernel.org/linux-fsdevel/cover.1723228772.git.josef@toxicpanda.com/
v2: https://lore.kernel.org/linux-fsdevel/cover.1723144881.git.josef@toxicpanda.com/
v1: https://lore.kernel.org/linux-fsdevel/cover.1721931241.git.josef@toxicpanda.com/

v4->v5:
- Cleaned up the various "I'll fix it on commit" notes that Jan made since I had
  to respin the series anyway.
- Renamed the filemap pagefault helper for fsnotify per Christians suggestion.
- Added a FS_ALLOW_HSM flag per Jan's comments, based on Amir's rough sketch.
- Added a patch to disable btrfs defrag on pre-content watched files.
- Added a patch to turn on FS_ALLOW_HSM for all the file systems that I tested.
- Added two fstests (which will be posted separately) to validate everything,
  re-validated the series with btrfs, xfs, ext4, and bcachefs to make sure I
  didn't break anything.

v3->v4:
- Trying to send a final verson Friday at 5pm before you go on vacation is a
  recipe for silly mistakes, fixed the xfs handling yet again, per Christoph's
  review.
- Reworked the file system helper so it's handling of fpin was a little less
  silly, per Chinner's suggestion.
- Updated the return values to not or in VM_FAULT_RETRY, as we have a comment
  in filemap_fault that says if VM_FAULT_ERROR is set we won't have
  VM_FAULT_RETRY set.

v2->v3:
- Fix the pagefault path to do MAY_ACCESS instead, updated the perm handler to
  emit PRE_ACCESS in this case, so we can avoid the extraneous perm event as per
  Amir's suggestion.
- Reworked the exported helper so the per-filesystem changes are much smaller,
  per Amir's suggestion.
- Fixed the screwup for DAX writes per Chinner's suggestion.
- Added Christian's reviewed-by's where appropriate.

v1->v2:
- reworked the page fault logic based on Jan's suggestion and turned it into a
  helper.
- Added 3 patches per-fs where we need to call the fsnotify helper from their
  ->fault handlers.
- Disabled readahead in the case that there's a pre-content watch in place.
- Disabled huge faults when there's a pre-content watch in place (entirely
  because it's untested, theoretically it should be straightforward to do).
- Updated the command numbers.
- Addressed the random spelling/grammer mistakes that Jan pointed out.
- Addressed the other random nits from Jan.

--- Original email ---

Hello,

These are the patches for the bare bones pre-content fanotify support.  The
majority of this work is Amir's, my contribution to this has solely been around
adding the page fault hooks, testing and validating everything.  I'm sending it
because Amir is traveling a bunch, and I touched it last so I'm going to take
all the hate and he can take all the credit.

There is a PoC that I've been using to validate this work, you can find the git
repo here

https://github.com/josefbacik/remote-fetch

This consists of 3 different tools.

1. populate.  This just creates all the stub files in the directory from the
   source directory.  Just run ./populate ~/linux ~/hsm-linux and it'll
   recursively create all of the stub files and directories.
2. remote-fetch.  This is the actual PoC, you just point it at the source and
   destination directory and then you can do whatever.  ./remote-fetch ~/linux
   ~/hsm-linux.
3. mmap-validate.  This was to validate the pagefault thing, this is likely what
   will be turned into the selftest with remote-fetch.  It creates a file and
   then you can validate the file matches the right pattern with both normal
   reads and mmap.  Normally I do something like

   ./mmap-validate create ~/src/foo
   ./populate ~/src ~/dst
   ./rmeote-fetch ~/src ~/dst
   ./mmap-validate validate ~/dst/foo

I did a bunch of testing, I also got some performance numbers.  I copied a
kernel tree, and then did remote-fetch, and then make -j4

Normal
real    9m49.709s
user    28m11.372s
sys     4m57.304s

HSM
real    10m6.454s
user    29m10.517s
sys     5m2.617s

So ~17 seconds more to build with HSM.  I then did a make mrproper on both trees
to see the size

[root@fedora ~]# du -hs /src/linux
1.6G    /src/linux
[root@fedora ~]# du -hs dst
125M    dst

This mirrors the sort of savings we've seen in production.

Meta has had these patches (minus the page fault patch) deployed in production
for almost a year with our own utility for doing on-demand package fetching.
The savings from this has been pretty significant.

The page-fault hooks are necessary for the last thing we need, which is
on-demand range fetching of executables.  Some of our binaries are several gigs
large, having the ability to remote fetch them on demand is a huge win for us
not only with space savings, but with startup time of containers.

There will be tests for this going into LTP once we're satisfied with the
patches and they're on their way upstream.  Thanks,

Josef

Amir Goldstein (8):
  fsnotify: introduce pre-content permission event
  fsnotify: generate pre-content permission event on open
  fanotify: introduce FAN_PRE_ACCESS permission event
  fanotify: introduce FAN_PRE_MODIFY permission event
  fanotify: pass optional file access range in pre-content event
  fanotify: rename a misnamed constant
  fanotify: report file range info with pre-content events
  fanotify: allow to set errno in FAN_DENY permission response

Josef Bacik (10):
  fanotify: don't skip extra event info if no info_mode is set
  fs: add a flag to indicate the fs supports pre-content events
  fanotify: add a helper to check for pre content events
  fanotify: disable readahead if we have pre-content watches
  mm: don't allow huge faults for files with pre content watches
  fsnotify: generate pre-content permission event on page fault
  bcachefs: add pre-content fsnotify hook to fault
  xfs: add pre-content fsnotify hook for write faults
  btrfs: disable defrag on pre-content watched files
  fs: enable pre-content events on supported file systems

 fs/bcachefs/fs-io-pagecache.c      |   4 +
 fs/bcachefs/fs.c                   |   2 +-
 fs/btrfs/ioctl.c                   |   9 ++
 fs/btrfs/super.c                   |   3 +-
 fs/ext4/super.c                    |   6 +-
 fs/namei.c                         |   9 ++
 fs/notify/fanotify/fanotify.c      |  33 ++++++--
 fs/notify/fanotify/fanotify.h      |  15 ++++
 fs/notify/fanotify/fanotify_user.c | 119 ++++++++++++++++++++++-----
 fs/notify/fsnotify.c               |  17 +++-
 fs/xfs/xfs_file.c                  |   4 +
 fs/xfs/xfs_super.c                 |   2 +-
 include/linux/fanotify.h           |  20 +++--
 include/linux/fs.h                 |   1 +
 include/linux/fsnotify.h           |  58 +++++++++++--
 include/linux/fsnotify_backend.h   |  59 ++++++++++++-
 include/linux/mm.h                 |   1 +
 include/uapi/linux/fanotify.h      |  18 ++++
 mm/filemap.c                       | 128 +++++++++++++++++++++++++++--
 mm/memory.c                        |  22 +++++
 mm/readahead.c                     |  13 +++
 security/selinux/hooks.c           |   3 +-
 22 files changed, 489 insertions(+), 57 deletions(-)

Comments

Amir Goldstein Sept. 5, 2024, 8:33 a.m. UTC | #1
On Wed, Sep 4, 2024 at 10:29 PM Josef Bacik <josef@toxicpanda.com> wrote:
>
> v4: https://lore.kernel.org/linux-fsdevel/cover.1723670362.git.josef@toxicpanda.com/
> v3: https://lore.kernel.org/linux-fsdevel/cover.1723228772.git.josef@toxicpanda.com/
> v2: https://lore.kernel.org/linux-fsdevel/cover.1723144881.git.josef@toxicpanda.com/
> v1: https://lore.kernel.org/linux-fsdevel/cover.1721931241.git.josef@toxicpanda.com/
>
> v4->v5:
> - Cleaned up the various "I'll fix it on commit" notes that Jan made since I had
>   to respin the series anyway.
> - Renamed the filemap pagefault helper for fsnotify per Christians suggestion.
> - Added a FS_ALLOW_HSM flag per Jan's comments, based on Amir's rough sketch.
> - Added a patch to disable btrfs defrag on pre-content watched files.
> - Added a patch to turn on FS_ALLOW_HSM for all the file systems that I tested.

My only nits are about different ordering of the FS_ALLOW_HSM patches
I guess as the merge window is closing in, Jan could do these trivial
reorders on commit, based on his preference (?).

> - Added two fstests (which will be posted separately) to validate everything,
>   re-validated the series with btrfs, xfs, ext4, and bcachefs to make sure I
>   didn't break anything.

Very cool!

Thanks again for the "productization" of my patches :)
Amir.
Jan Kara Sept. 5, 2024, 10:32 a.m. UTC | #2
On Thu 05-09-24 10:33:07, Amir Goldstein wrote:
> On Wed, Sep 4, 2024 at 10:29 PM Josef Bacik <josef@toxicpanda.com> wrote:
> >
> > v4: https://lore.kernel.org/linux-fsdevel/cover.1723670362.git.josef@toxicpanda.com/
> > v3: https://lore.kernel.org/linux-fsdevel/cover.1723228772.git.josef@toxicpanda.com/
> > v2: https://lore.kernel.org/linux-fsdevel/cover.1723144881.git.josef@toxicpanda.com/
> > v1: https://lore.kernel.org/linux-fsdevel/cover.1721931241.git.josef@toxicpanda.com/
> >
> > v4->v5:
> > - Cleaned up the various "I'll fix it on commit" notes that Jan made since I had
> >   to respin the series anyway.
> > - Renamed the filemap pagefault helper for fsnotify per Christians suggestion.
> > - Added a FS_ALLOW_HSM flag per Jan's comments, based on Amir's rough sketch.
> > - Added a patch to disable btrfs defrag on pre-content watched files.
> > - Added a patch to turn on FS_ALLOW_HSM for all the file systems that I tested.
> 
> My only nits are about different ordering of the FS_ALLOW_HSM patches
> I guess as the merge window is closing in, Jan could do these trivial
> reorders on commit, based on his preference (?).

Yes, I can do the reordering on commit.

								Honza
Jan Kara Sept. 5, 2024, 12:08 p.m. UTC | #3
Hello!

On Wed 04-09-24 16:27:50, Josef Bacik wrote:
> These are the patches for the bare bones pre-content fanotify support.  The
> majority of this work is Amir's, my contribution to this has solely been around
> adding the page fault hooks, testing and validating everything.  I'm sending it
> because Amir is traveling a bunch, and I touched it last so I'm going to take
> all the hate and he can take all the credit.
> 
> There is a PoC that I've been using to validate this work, you can find the git
> repo here
> 
> https://github.com/josefbacik/remote-fetch

The test tool seems to be a bit outdated wrt the current series. It took me
quite a while to debug why HSM isn't working with it (eventually I've
tracked it down to the changes in struct fanotify_event_info_range...).
Anyway all seems to be working (after fixing up some missing export), I've
pushed out the result I have to:

https://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs.git fsnotify

and will push it to linux-next as well so that it gets some soaking before
the merge window. That being said I'd still like to get explicit ack from
XFS folks (hint) so don't patches may still rebase due to that.

								Honza

 
> This consists of 3 different tools.
> 
> 1. populate.  This just creates all the stub files in the directory from the
>    source directory.  Just run ./populate ~/linux ~/hsm-linux and it'll
>    recursively create all of the stub files and directories.
> 2. remote-fetch.  This is the actual PoC, you just point it at the source and
>    destination directory and then you can do whatever.  ./remote-fetch ~/linux
>    ~/hsm-linux.
> 3. mmap-validate.  This was to validate the pagefault thing, this is likely what
>    will be turned into the selftest with remote-fetch.  It creates a file and
>    then you can validate the file matches the right pattern with both normal
>    reads and mmap.  Normally I do something like
> 
>    ./mmap-validate create ~/src/foo
>    ./populate ~/src ~/dst
>    ./rmeote-fetch ~/src ~/dst
>    ./mmap-validate validate ~/dst/foo



> 
> I did a bunch of testing, I also got some performance numbers.  I copied a
> kernel tree, and then did remote-fetch, and then make -j4
> 
> Normal
> real    9m49.709s
> user    28m11.372s
> sys     4m57.304s
> 
> HSM
> real    10m6.454s
> user    29m10.517s
> sys     5m2.617s
> 
> So ~17 seconds more to build with HSM.  I then did a make mrproper on both trees
> to see the size
> 
> [root@fedora ~]# du -hs /src/linux
> 1.6G    /src/linux
> [root@fedora ~]# du -hs dst
> 125M    dst
> 
> This mirrors the sort of savings we've seen in production.
> 
> Meta has had these patches (minus the page fault patch) deployed in production
> for almost a year with our own utility for doing on-demand package fetching.
> The savings from this has been pretty significant.
> 
> The page-fault hooks are necessary for the last thing we need, which is
> on-demand range fetching of executables.  Some of our binaries are several gigs
> large, having the ability to remote fetch them on demand is a huge win for us
> not only with space savings, but with startup time of containers.
> 
> There will be tests for this going into LTP once we're satisfied with the
> patches and they're on their way upstream.  Thanks,
> 
> Josef
> 
> Amir Goldstein (8):
>   fsnotify: introduce pre-content permission event
>   fsnotify: generate pre-content permission event on open
>   fanotify: introduce FAN_PRE_ACCESS permission event
>   fanotify: introduce FAN_PRE_MODIFY permission event
>   fanotify: pass optional file access range in pre-content event
>   fanotify: rename a misnamed constant
>   fanotify: report file range info with pre-content events
>   fanotify: allow to set errno in FAN_DENY permission response
> 
> Josef Bacik (10):
>   fanotify: don't skip extra event info if no info_mode is set
>   fs: add a flag to indicate the fs supports pre-content events
>   fanotify: add a helper to check for pre content events
>   fanotify: disable readahead if we have pre-content watches
>   mm: don't allow huge faults for files with pre content watches
>   fsnotify: generate pre-content permission event on page fault
>   bcachefs: add pre-content fsnotify hook to fault
>   xfs: add pre-content fsnotify hook for write faults
>   btrfs: disable defrag on pre-content watched files
>   fs: enable pre-content events on supported file systems
> 
>  fs/bcachefs/fs-io-pagecache.c      |   4 +
>  fs/bcachefs/fs.c                   |   2 +-
>  fs/btrfs/ioctl.c                   |   9 ++
>  fs/btrfs/super.c                   |   3 +-
>  fs/ext4/super.c                    |   6 +-
>  fs/namei.c                         |   9 ++
>  fs/notify/fanotify/fanotify.c      |  33 ++++++--
>  fs/notify/fanotify/fanotify.h      |  15 ++++
>  fs/notify/fanotify/fanotify_user.c | 119 ++++++++++++++++++++++-----
>  fs/notify/fsnotify.c               |  17 +++-
>  fs/xfs/xfs_file.c                  |   4 +
>  fs/xfs/xfs_super.c                 |   2 +-
>  include/linux/fanotify.h           |  20 +++--
>  include/linux/fs.h                 |   1 +
>  include/linux/fsnotify.h           |  58 +++++++++++--
>  include/linux/fsnotify_backend.h   |  59 ++++++++++++-
>  include/linux/mm.h                 |   1 +
>  include/uapi/linux/fanotify.h      |  18 ++++
>  mm/filemap.c                       | 128 +++++++++++++++++++++++++++--
>  mm/memory.c                        |  22 +++++
>  mm/readahead.c                     |  13 +++
>  security/selinux/hooks.c           |   3 +-
>  22 files changed, 489 insertions(+), 57 deletions(-)
> 
> -- 
> 2.43.0
>
Josef Bacik Sept. 5, 2024, 7:29 p.m. UTC | #4
On Thu, Sep 05, 2024 at 02:08:08PM +0200, Jan Kara wrote:
> Hello!
> 
> On Wed 04-09-24 16:27:50, Josef Bacik wrote:
> > These are the patches for the bare bones pre-content fanotify support.  The
> > majority of this work is Amir's, my contribution to this has solely been around
> > adding the page fault hooks, testing and validating everything.  I'm sending it
> > because Amir is traveling a bunch, and I touched it last so I'm going to take
> > all the hate and he can take all the credit.
> > 
> > There is a PoC that I've been using to validate this work, you can find the git
> > repo here
> > 
> > https://github.com/josefbacik/remote-fetch
> 
> The test tool seems to be a bit outdated wrt the current series. It took me
> quite a while to debug why HSM isn't working with it (eventually I've
> tracked it down to the changes in struct fanotify_event_info_range...).
> Anyway all seems to be working (after fixing up some missing export), I've
> pushed out the result I have to:

Eesh sorry, I updated it for the fstests and used that as the source of truth
for this stuff, which is how I validated all of the fs'es that got the
FS_ALLOW_HSM flag.

> 
> https://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs.git fsnotify
> 
> and will push it to linux-next as well so that it gets some soaking before
> the merge window. That being said I'd still like to get explicit ack from
> XFS folks (hint) so don't patches may still rebase due to that.
> 

Awesome, thanks!

Josef
Josef Bacik Sept. 5, 2024, 7:30 p.m. UTC | #5
On Thu, Sep 05, 2024 at 10:33:07AM +0200, Amir Goldstein wrote:
> On Wed, Sep 4, 2024 at 10:29 PM Josef Bacik <josef@toxicpanda.com> wrote:
> >
> > v4: https://lore.kernel.org/linux-fsdevel/cover.1723670362.git.josef@toxicpanda.com/
> > v3: https://lore.kernel.org/linux-fsdevel/cover.1723228772.git.josef@toxicpanda.com/
> > v2: https://lore.kernel.org/linux-fsdevel/cover.1723144881.git.josef@toxicpanda.com/
> > v1: https://lore.kernel.org/linux-fsdevel/cover.1721931241.git.josef@toxicpanda.com/
> >
> > v4->v5:
> > - Cleaned up the various "I'll fix it on commit" notes that Jan made since I had
> >   to respin the series anyway.
> > - Renamed the filemap pagefault helper for fsnotify per Christians suggestion.
> > - Added a FS_ALLOW_HSM flag per Jan's comments, based on Amir's rough sketch.
> > - Added a patch to disable btrfs defrag on pre-content watched files.
> > - Added a patch to turn on FS_ALLOW_HSM for all the file systems that I tested.
> 
> My only nits are about different ordering of the FS_ALLOW_HSM patches
> I guess as the merge window is closing in, Jan could do these trivial
> reorders on commit, based on his preference (?).
> 
> > - Added two fstests (which will be posted separately) to validate everything,
> >   re-validated the series with btrfs, xfs, ext4, and bcachefs to make sure I
> >   didn't break anything.
> 
> Very cool!
> 
> Thanks again for the "productization" of my patches :)

Thanks for doing all the heavy lifting in the first place! Glad we can move on
to other things from here,

Josef