Message ID | cover.1723670362.git.josef@toxicpanda.com (mailing list archive) |
---|---|
Headers | show |
Series | fanotify: add pre-content hooks | expand |
On Wed, Aug 14, 2024 at 05:25:18PM -0400, Josef Bacik wrote: > v3: https://lore.kernel.org/linux-fsdevel/cover.1723228772.git.josef@toxicpanda.com/ > v2: https://lore.kernel.org/linux-fsdevel/cover.1723144881.git.josef@toxicpanda.com/ > v1: https://lore.kernel.org/linux-fsdevel/cover.1721931241.git.josef@toxicpanda.com/ > > v3->v4: > - Trying to send a final verson Friday at 5pm before you go on vacation is a > recipe for silly mistakes, fixed the xfs handling yet again, per Christoph's > review. > - Reworked the file system helper so it's handling of fpin was a little less > silly, per Chinner's suggestion. > - Updated the return values to not or in VM_FAULT_RETRY, as we have a comment > in filemap_fault that says if VM_FAULT_ERROR is set we won't have > VM_FAULT_RETRY set. > > v2->v3: > - Fix the pagefault path to do MAY_ACCESS instead, updated the perm handler to > emit PRE_ACCESS in this case, so we can avoid the extraneous perm event as per > Amir's suggestion. > - Reworked the exported helper so the per-filesystem changes are much smaller, > per Amir's suggestion. > - Fixed the screwup for DAX writes per Chinner's suggestion. > - Added Christian's reviewed-by's where appropriate. > > v1->v2: > - reworked the page fault logic based on Jan's suggestion and turned it into a > helper. > - Added 3 patches per-fs where we need to call the fsnotify helper from their > ->fault handlers. > - Disabled readahead in the case that there's a pre-content watch in place. > - Disabled huge faults when there's a pre-content watch in place (entirely > because it's untested, theoretically it should be straightforward to do). > - Updated the command numbers. > - Addressed the random spelling/grammer mistakes that Jan pointed out. > - Addressed the other random nits from Jan. > > --- Original email --- > > Hello, > > These are the patches for the bare bones pre-content fanotify support. The > majority of this work is Amir's, my contribution to this has solely been around > adding the page fault hooks, testing and validating everything. I'm sending it > because Amir is traveling a bunch, and I touched it last so I'm going to take > all the hate and he can take all the credit. > > There is a PoC that I've been using to validate this work, you can find the git > repo here > > https://github.com/josefbacik/remote-fetch > > This consists of 3 different tools. > > 1. populate. This just creates all the stub files in the directory from the > source directory. Just run ./populate ~/linux ~/hsm-linux and it'll > recursively create all of the stub files and directories. > 2. remote-fetch. This is the actual PoC, you just point it at the source and > destination directory and then you can do whatever. ./remote-fetch ~/linux > ~/hsm-linux. > 3. mmap-validate. This was to validate the pagefault thing, this is likely what > will be turned into the selftest with remote-fetch. It creates a file and > then you can validate the file matches the right pattern with both normal > reads and mmap. Normally I do something like > > ./mmap-validate create ~/src/foo > ./populate ~/src ~/dst > ./rmeote-fetch ~/src ~/dst > ./mmap-validate validate ~/dst/foo > > I did a bunch of testing, I also got some performance numbers. I copied a > kernel tree, and then did remote-fetch, and then make -j4 > > Normal > real 9m49.709s > user 28m11.372s > sys 4m57.304s > > HSM > real 10m6.454s > user 29m10.517s > sys 5m2.617s > > So ~17 seconds more to build with HSM. I then did a make mrproper on both trees > to see the size > > [root@fedora ~]# du -hs /src/linux > 1.6G /src/linux > [root@fedora ~]# du -hs dst > 125M dst > > This mirrors the sort of savings we've seen in production. > > Meta has had these patches (minus the page fault patch) deployed in production > for almost a year with our own utility for doing on-demand package fetching. > The savings from this has been pretty significant. > > The page-fault hooks are necessary for the last thing we need, which is > on-demand range fetching of executables. Some of our binaries are several gigs > large, having the ability to remote fetch them on demand is a huge win for us > not only with space savings, but with startup time of containers. So... does this pre-content fetcher already work for regular reads and writes, and now you're looking to wire up page faults? Or does it only handle page faults? Judging from Amir's patches I'm guessing the FAN_PRE_{ACCESS,MODIFY} events are new, so this only sends notifications prior to read and write page faults? The XFS change looks reasonable to me, but I'm left wondering "what does this shiny /do/?" --D > There will be tests for this going into LTP once we're satisfied with the > patches and they're on their way upstream. Thanks, > > Josef > > Amir Goldstein (8): > fsnotify: introduce pre-content permission event > fsnotify: generate pre-content permission event on open > fanotify: introduce FAN_PRE_ACCESS permission event > fanotify: introduce FAN_PRE_MODIFY permission event > fanotify: pass optional file access range in pre-content event > fanotify: rename a misnamed constant > fanotify: report file range info with pre-content events > fanotify: allow to set errno in FAN_DENY permission response > > Josef Bacik (8): > fanotify: don't skip extra event info if no info_mode is set > fanotify: add a helper to check for pre content events > fanotify: disable readahead if we have pre-content watches > mm: don't allow huge faults for files with pre content watches > fsnotify: generate pre-content permission event on page fault > bcachefs: add pre-content fsnotify hook to fault > gfs2: add pre-content fsnotify hook to fault > xfs: add pre-content fsnotify hook for write faults > > fs/bcachefs/fs-io-pagecache.c | 4 + > fs/gfs2/file.c | 4 + > fs/namei.c | 9 ++ > fs/notify/fanotify/fanotify.c | 32 ++++++-- > fs/notify/fanotify/fanotify.h | 20 +++++ > fs/notify/fanotify/fanotify_user.c | 116 +++++++++++++++++++++----- > fs/notify/fsnotify.c | 14 +++- > fs/xfs/xfs_file.c | 4 + > include/linux/fanotify.h | 20 +++-- > include/linux/fsnotify.h | 54 ++++++++++-- > include/linux/fsnotify_backend.h | 59 ++++++++++++- > include/linux/mm.h | 1 + > include/uapi/linux/fanotify.h | 17 ++++ > mm/filemap.c | 128 +++++++++++++++++++++++++++-- > mm/memory.c | 22 +++++ > mm/readahead.c | 13 +++ > security/selinux/hooks.c | 3 +- > 17 files changed, 469 insertions(+), 51 deletions(-) > > -- > 2.43.0 > >
On Thu, Aug 29, 2024 at 11:41 PM Darrick J. Wong <djwong@kernel.org> wrote: > > On Wed, Aug 14, 2024 at 05:25:18PM -0400, Josef Bacik wrote: > > v3: https://lore.kernel.org/linux-fsdevel/cover.1723228772.git.josef@toxicpanda.com/ > > v2: https://lore.kernel.org/linux-fsdevel/cover.1723144881.git.josef@toxicpanda.com/ > > v1: https://lore.kernel.org/linux-fsdevel/cover.1721931241.git.josef@toxicpanda.com/ > > > > v3->v4: > > - Trying to send a final verson Friday at 5pm before you go on vacation is a > > recipe for silly mistakes, fixed the xfs handling yet again, per Christoph's > > review. > > - Reworked the file system helper so it's handling of fpin was a little less > > silly, per Chinner's suggestion. > > - Updated the return values to not or in VM_FAULT_RETRY, as we have a comment > > in filemap_fault that says if VM_FAULT_ERROR is set we won't have > > VM_FAULT_RETRY set. > > > > v2->v3: > > - Fix the pagefault path to do MAY_ACCESS instead, updated the perm handler to > > emit PRE_ACCESS in this case, so we can avoid the extraneous perm event as per > > Amir's suggestion. > > - Reworked the exported helper so the per-filesystem changes are much smaller, > > per Amir's suggestion. > > - Fixed the screwup for DAX writes per Chinner's suggestion. > > - Added Christian's reviewed-by's where appropriate. > > > > v1->v2: > > - reworked the page fault logic based on Jan's suggestion and turned it into a > > helper. > > - Added 3 patches per-fs where we need to call the fsnotify helper from their > > ->fault handlers. > > - Disabled readahead in the case that there's a pre-content watch in place. > > - Disabled huge faults when there's a pre-content watch in place (entirely > > because it's untested, theoretically it should be straightforward to do). > > - Updated the command numbers. > > - Addressed the random spelling/grammer mistakes that Jan pointed out. > > - Addressed the other random nits from Jan. > > > > --- Original email --- > > > > Hello, > > > > These are the patches for the bare bones pre-content fanotify support. The > > majority of this work is Amir's, my contribution to this has solely been around > > adding the page fault hooks, testing and validating everything. I'm sending it > > because Amir is traveling a bunch, and I touched it last so I'm going to take > > all the hate and he can take all the credit. > > > > There is a PoC that I've been using to validate this work, you can find the git > > repo here > > > > https://github.com/josefbacik/remote-fetch > > > > This consists of 3 different tools. > > > > 1. populate. This just creates all the stub files in the directory from the > > source directory. Just run ./populate ~/linux ~/hsm-linux and it'll > > recursively create all of the stub files and directories. > > 2. remote-fetch. This is the actual PoC, you just point it at the source and > > destination directory and then you can do whatever. ./remote-fetch ~/linux > > ~/hsm-linux. > > 3. mmap-validate. This was to validate the pagefault thing, this is likely what > > will be turned into the selftest with remote-fetch. It creates a file and > > then you can validate the file matches the right pattern with both normal > > reads and mmap. Normally I do something like > > > > ./mmap-validate create ~/src/foo > > ./populate ~/src ~/dst > > ./rmeote-fetch ~/src ~/dst > > ./mmap-validate validate ~/dst/foo > > > > I did a bunch of testing, I also got some performance numbers. I copied a > > kernel tree, and then did remote-fetch, and then make -j4 > > > > Normal > > real 9m49.709s > > user 28m11.372s > > sys 4m57.304s > > > > HSM > > real 10m6.454s > > user 29m10.517s > > sys 5m2.617s > > > > So ~17 seconds more to build with HSM. I then did a make mrproper on both trees > > to see the size > > > > [root@fedora ~]# du -hs /src/linux > > 1.6G /src/linux > > [root@fedora ~]# du -hs dst > > 125M dst > > > > This mirrors the sort of savings we've seen in production. > > > > Meta has had these patches (minus the page fault patch) deployed in production > > for almost a year with our own utility for doing on-demand package fetching. > > The savings from this has been pretty significant. > > > > The page-fault hooks are necessary for the last thing we need, which is > > on-demand range fetching of executables. Some of our binaries are several gigs > > large, having the ability to remote fetch them on demand is a huge win for us > > not only with space savings, but with startup time of containers. > > So... does this pre-content fetcher already work for regular reads and > writes, and now you're looking to wire up page faults? Or does it only > handle page faults? Judging from Amir's patches I'm guessing the > FAN_PRE_{ACCESS,MODIFY} events are new, so this only sends notifications > prior to read and write page faults? The XFS change looks reasonable to > me, but I'm left wondering "what does this shiny /do/?" > I *think* I understand the confusion. Let me try to sort it out. This patch set collaboration aims to add the functionality of HSM service by adding events FS_PRE_{ACCESS,MODIFY} prior to read/write and page faults. Maybe you are puzzled by not seeing any new read/write hooks? This is because the HSM events utilize the existing fsnotify_file_*perm() hooks that are in place for the legacy FS_ACCESS_PERM event. So why is a new FS_PRE_ACCESS needed? Let me first quote commit message of patch 2/16 [1]: --- The new FS_PRE_ACCESS permission event is similar to FS_ACCESS_PERM, but it meant for a different use case of filling file content before access to a file range, so it has slightly different semantics. Generate FS_PRE_ACCESS/FS_ACCESS_PERM as two seperate events, same as we did for FS_OPEN_PERM/FS_OPEN_EXEC_PERM. FS_PRE_MODIFY is a new permission event, with similar semantics as FS_PRE_ACCESS, which is called before a file is modified. FS_ACCESS_PERM is reported also on blockdev and pipes, but the new pre-content events are only reported for regular files and dirs. The pre-content events are meant to be used by hierarchical storage managers that want to fill the content of files on first access. --- And from my man page draft [2]: --- FAN_PRE_ACCESS (since Linux 6.x) Create an event before a read from a directory or a file range, that provides an opportunity for the event listener to modify the content of the object before the reader is going to access the content in the specified range. An additional information record of type FAN_EVENT_INFO_TYPE_RANGE is returned for each event in the read buffer. ... --- So the semantics of the two events is slightly different, but also the meaning of "an opportunity for the event listener to modify the content". FS_ACCESS_PERM already provided this opportunity on old kernels, but prior to "Tidy up file permission hooks" series [3] in v6.8, writing file content in FS_ACCESS_PERM event context was prone to deadlocks. Therefore, an application using FS_ACCESS_PERM may be prone for deadlocks, while an application using FAN_PRE_ACCESS should be safe in that regard. Thanks, Amir. [1] https://lore.kernel.org/all/a96217d84dfebb15582a04524dc9821ba3ea1406.1723670362.git.josef@toxicpanda.com/ [2] https://github.com/amir73il/man-pages/commits/fan_pre_path [3] https://lore.kernel.org/all/20231122122715.2561213-1-amir73il@gmail.com/
On Fri, Aug 30, 2024 at 10:55:10AM +0200, Amir Goldstein wrote: > On Thu, Aug 29, 2024 at 11:41 PM Darrick J. Wong <djwong@kernel.org> wrote: > > > > On Wed, Aug 14, 2024 at 05:25:18PM -0400, Josef Bacik wrote: > > > v3: https://lore.kernel.org/linux-fsdevel/cover.1723228772.git.josef@toxicpanda.com/ > > > v2: https://lore.kernel.org/linux-fsdevel/cover.1723144881.git.josef@toxicpanda.com/ > > > v1: https://lore.kernel.org/linux-fsdevel/cover.1721931241.git.josef@toxicpanda.com/ > > > > > > v3->v4: > > > - Trying to send a final verson Friday at 5pm before you go on vacation is a > > > recipe for silly mistakes, fixed the xfs handling yet again, per Christoph's > > > review. > > > - Reworked the file system helper so it's handling of fpin was a little less > > > silly, per Chinner's suggestion. > > > - Updated the return values to not or in VM_FAULT_RETRY, as we have a comment > > > in filemap_fault that says if VM_FAULT_ERROR is set we won't have > > > VM_FAULT_RETRY set. > > > > > > v2->v3: > > > - Fix the pagefault path to do MAY_ACCESS instead, updated the perm handler to > > > emit PRE_ACCESS in this case, so we can avoid the extraneous perm event as per > > > Amir's suggestion. > > > - Reworked the exported helper so the per-filesystem changes are much smaller, > > > per Amir's suggestion. > > > - Fixed the screwup for DAX writes per Chinner's suggestion. > > > - Added Christian's reviewed-by's where appropriate. > > > > > > v1->v2: > > > - reworked the page fault logic based on Jan's suggestion and turned it into a > > > helper. > > > - Added 3 patches per-fs where we need to call the fsnotify helper from their > > > ->fault handlers. > > > - Disabled readahead in the case that there's a pre-content watch in place. > > > - Disabled huge faults when there's a pre-content watch in place (entirely > > > because it's untested, theoretically it should be straightforward to do). > > > - Updated the command numbers. > > > - Addressed the random spelling/grammer mistakes that Jan pointed out. > > > - Addressed the other random nits from Jan. > > > > > > --- Original email --- > > > > > > Hello, > > > > > > These are the patches for the bare bones pre-content fanotify support. The > > > majority of this work is Amir's, my contribution to this has solely been around > > > adding the page fault hooks, testing and validating everything. I'm sending it > > > because Amir is traveling a bunch, and I touched it last so I'm going to take > > > all the hate and he can take all the credit. > > > > > > There is a PoC that I've been using to validate this work, you can find the git > > > repo here > > > > > > https://github.com/josefbacik/remote-fetch > > > > > > This consists of 3 different tools. > > > > > > 1. populate. This just creates all the stub files in the directory from the > > > source directory. Just run ./populate ~/linux ~/hsm-linux and it'll > > > recursively create all of the stub files and directories. > > > 2. remote-fetch. This is the actual PoC, you just point it at the source and > > > destination directory and then you can do whatever. ./remote-fetch ~/linux > > > ~/hsm-linux. > > > 3. mmap-validate. This was to validate the pagefault thing, this is likely what > > > will be turned into the selftest with remote-fetch. It creates a file and > > > then you can validate the file matches the right pattern with both normal > > > reads and mmap. Normally I do something like > > > > > > ./mmap-validate create ~/src/foo > > > ./populate ~/src ~/dst > > > ./rmeote-fetch ~/src ~/dst > > > ./mmap-validate validate ~/dst/foo > > > > > > I did a bunch of testing, I also got some performance numbers. I copied a > > > kernel tree, and then did remote-fetch, and then make -j4 > > > > > > Normal > > > real 9m49.709s > > > user 28m11.372s > > > sys 4m57.304s > > > > > > HSM > > > real 10m6.454s > > > user 29m10.517s > > > sys 5m2.617s > > > > > > So ~17 seconds more to build with HSM. I then did a make mrproper on both trees > > > to see the size > > > > > > [root@fedora ~]# du -hs /src/linux > > > 1.6G /src/linux > > > [root@fedora ~]# du -hs dst > > > 125M dst > > > > > > This mirrors the sort of savings we've seen in production. > > > > > > Meta has had these patches (minus the page fault patch) deployed in production > > > for almost a year with our own utility for doing on-demand package fetching. > > > The savings from this has been pretty significant. > > > > > > The page-fault hooks are necessary for the last thing we need, which is > > > on-demand range fetching of executables. Some of our binaries are several gigs > > > large, having the ability to remote fetch them on demand is a huge win for us > > > not only with space savings, but with startup time of containers. > > > > So... does this pre-content fetcher already work for regular reads and > > writes, and now you're looking to wire up page faults? Or does it only > > handle page faults? Judging from Amir's patches I'm guessing the > > FAN_PRE_{ACCESS,MODIFY} events are new, so this only sends notifications > > prior to read and write page faults? The XFS change looks reasonable to > > me, but I'm left wondering "what does this shiny /do/?" > > > > I *think* I understand the confusion. > > Let me try to sort it out. > > This patch set collaboration aims to add the functionality of HSM > service by adding events FS_PRE_{ACCESS,MODIFY} prior to > read/write and page faults. > > Maybe you are puzzled by not seeing any new read/write hooks? > This is because the HSM events utilize the existing fsnotify_file_*perm() > hooks that are in place for the legacy FS_ACCESS_PERM event. > > So why is a new FS_PRE_ACCESS needed? > Let me first quote commit message of patch 2/16 [1]: > --- > The new FS_PRE_ACCESS permission event is similar to FS_ACCESS_PERM, > but it meant for a different use case of filling file content before > access to a file range, so it has slightly different semantics. > > Generate FS_PRE_ACCESS/FS_ACCESS_PERM as two seperate events, same as > we did for FS_OPEN_PERM/FS_OPEN_EXEC_PERM. > > FS_PRE_MODIFY is a new permission event, with similar semantics as > FS_PRE_ACCESS, which is called before a file is modified. > > FS_ACCESS_PERM is reported also on blockdev and pipes, but the new > pre-content events are only reported for regular files and dirs. > > The pre-content events are meant to be used by hierarchical storage > managers that want to fill the content of files on first access. > --- > > And from my man page draft [2]: > --- > FAN_PRE_ACCESS (since Linux 6.x) > Create an event before a read from a directory or a file range, > that provides an opportunity for the event listener to > modify the content of the object > before the reader is going to access the content in the > specified range. > An additional information record of type > FAN_EVENT_INFO_TYPE_RANGE is returned > for each event in the read buffer. > ... > --- > > So the semantics of the two events is slightly different, but also the > meaning of "an opportunity for the event listener to modify the content". > > FS_ACCESS_PERM already provided this opportunity on old kernels, > but prior to "Tidy up file permission hooks" series [3] in v6.8, writing > file content in FS_ACCESS_PERM event context was prone to deadlocks. > > Therefore, an application using FS_ACCESS_PERM may be prone > for deadlocks, while an application using FAN_PRE_ACCESS should > be safe in that regard. Ah, ok, that's where the missing pieces are. :) --D > Thanks, > Amir. > > [1] https://lore.kernel.org/all/a96217d84dfebb15582a04524dc9821ba3ea1406.1723670362.git.josef@toxicpanda.com/ > [2] https://github.com/amir73il/man-pages/commits/fan_pre_path > [3] https://lore.kernel.org/all/20231122122715.2561213-1-amir73il@gmail.com/ >