Message ID | b509ec78c045d67d4d7e31976eba4b708b238b66.1731355931.git.josef@toxicpanda.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | fanotify: add pre-content hooks | expand |
On Mon, 11 Nov 2024 at 12:19, Josef Bacik <josef@toxicpanda.com> wrote: > > --- a/fs/namei.c > +++ b/fs/namei.c > @@ -3782,7 +3782,15 @@ static int do_open(struct nameidata *nd, > + /* > + * This permission hook is different than fsnotify_open_perm() hook. > + * This is a pre-content hook that is called without sb_writers held > + * and after the file was truncated. > + */ > + return fsnotify_file_area_perm(file, MAY_OPEN, &file->f_pos, 0); > } Stop adding sh*t like this to the VFS layer. Seriously. I spend time and effort looking at profiles, and then people who do *not* seem to spend the time and effort just willy nilly add fsnotify and security events and show down basic paths. I'm going to NAK any new fsnotify and permission hooks unless people show that they don't add any overhead. Because I'm really really tired of having to wade through various permission hooks in the profiles that I can not fix or optimize, because those hoosk have no sane defined semantics, just "let user space know". Yes, right now it's mostly just the security layer. But this really looks to me like now fsnotify will be the same kind of pain. And that location is STUPID. Dammit, it is even *documented* to be stupid. It's a "pre-content" hook that happens after the contents have already been truncated. WTF? That's no "pre". I tried to follow the deep chain of inlines to see what actually ends up happening, and it looks like if the *whole* filesystem has no notify events at all, the fsnotify_sb_has_watchers() check will make this mostly go away, except for all the D$ accesses needed just to check for it. But even *one* entirely unrelated event will now force every single open to typically call __fsnotify_parent() (or possibly "just" fsnotify), because there's no sane "nobody cares about this dentry" kind of thing. So effectively this is a new hook that gets called on every single open call that nobody else cares about than you, and that people have lived without for three decades. Stop it, or at least add the code to not do this all completely pointlessly. Because otherwise I will not take this kind of stuff any more. I just spent time trying to figure out how to avoid the pointless cache misses we did for every path component traversal. So I *really* don't want to see another pointless stupid fsnotify hook in my profiles. Linus
On Mon, Nov 11, 2024 at 4:52 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > On Mon, 11 Nov 2024 at 12:19, Josef Bacik <josef@toxicpanda.com> wrote: > > > > --- a/fs/namei.c > > +++ b/fs/namei.c > > @@ -3782,7 +3782,15 @@ static int do_open(struct nameidata *nd, > > + /* > > + * This permission hook is different than fsnotify_open_perm() hook. > > + * This is a pre-content hook that is called without sb_writers held > > + * and after the file was truncated. > > + */ > > + return fsnotify_file_area_perm(file, MAY_OPEN, &file->f_pos, 0); > > } > > Stop adding sh*t like this to the VFS layer. > > Seriously. I spend time and effort looking at profiles, and then > people who do *not* seem to spend the time and effort just willy nilly > add fsnotify and security events and show down basic paths. > > I'm going to NAK any new fsnotify and permission hooks unless people > show that they don't add any overhead. > > Because I'm really really tired of having to wade through various > permission hooks in the profiles that I can not fix or optimize, > because those hoosk have no sane defined semantics, just "let user > space know". > > Yes, right now it's mostly just the security layer. But this really > looks to me like now fsnotify will be the same kind of pain. > > And that location is STUPID. Dammit, it is even *documented* to be > stupid. It's a "pre-content" hook that happens after the contents have > already been truncated. WTF? That's no "pre". > > I tried to follow the deep chain of inlines to see what actually ends > up happening, and it looks like if the *whole* filesystem has no > notify events at all, the fsnotify_sb_has_watchers() check will make > this mostly go away, except for all the D$ accesses needed just to > check for it. > > But even *one* entirely unrelated event will now force every single > open to typically call __fsnotify_parent() (or possibly "just" > fsnotify), because there's no sane "nobody cares about this dentry" > kind of thing. > > So effectively this is a new hook that gets called on every single > open call that nobody else cares about than you, and that people have > lived without for three decades. > > Stop it, or at least add the code to not do this all completely pointlessly. > > Because otherwise I will not take this kind of stuff any more. I just > spent time trying to figure out how to avoid the pointless cache > misses we did for every path component traversal. > > So I *really* don't want to see another pointless stupid fsnotify hook > in my profiles. Did you see the patch that added the fsnotify_file_has_pre_content_watches() thing? It'll look at the inode/sb/mnt to see if there are any watches and carry on if there's nothing. This will be the cheapest thing open/write/whatever does. If there's no watches, nothing happens. I'm not entirely sure what else you want me to do here, other than not add the feature, which I suppose is a position to take. The overhead here is purely for people who use HSM. I'm telling you that in production, with real world workloads, the overhead is far outweighed by having to copy bytes around we'll never use. Your argument is similar to the argument against adding compression to btrfs, "but it costs CPU!?", sure, but the benefit of writing significantly less waaaaay outweighed the CPU cost and was a huge net positive. This is going to cost something if it's on, it costs nothing if it's off. If you want performance numbers that's reasonable, I'll run fsperf tomorrow and get you a side by side comparison. Thanks, Josef
On Mon, 11 Nov 2024 at 14:46, Josef Bacik <josef@toxicpanda.com> wrote: > > Did you see the patch that added the > fsnotify_file_has_pre_content_watches() thing? No, because I had gotten to patch 6/11, and it added this open thing, and there was no such thing in any of the patches before it. It looks like you added FSNOTIFY_PRE_CONTENT_EVENTS in 11/17. However, at no point does it look like you actually test it at open time, so none of this seems to matter. As far as I can see, even at the end of the series, you will call the fsnotify hook at open time even if there are no content watches on the file. So apparently the fsnotify_file_has_pre_content_watches() is not called when it should be, and when it *is* called, it's also doing completely the wrong thing. Look, for basic operations THAT DON'T CARE, you now added a function call to fsnotify_file_has_pre_content_watches(), that function call looks at inode->i_sb->s_iflags (doing two D$ accesses that shouldn't be done!), and then after that looks at the i_fsnotify_mask. THIS IS EXACTLY THE KIND OF GARBAGE I'M TALKING ABOUT. This code has been written by somebody who NEVER EVER looked at profiles. You're following chains of pointers when you never should. Look, here's a very basic example of the kind of complete mis-design I'm talking about: - we're doing a basic read() on a file that isn't being watched. - we want to maybe do read-ahead - the code does if (fsnotify_file_has_pre_content_watches(file)) return fpin; to say that "don't do read-ahead". Fine, I understand the concept. But keep in mind that the common case is presumably that there are no content watches. And even ignoring the "common case" issue, that's the one you want to OPTIMIZE for. That's the case that matters for performance, because clearly if there are content watches, you're going to go into "Go Slow" mode anyway and not do pre-fetching. So even if content watches are common on some load, they are clearly not the case you should do performance optimization for. With me so far? So if THAT is the case that matters, then dammit, we shouldn't be calling a function at all. And when calling the function, we shouldn't start out with this completely broken logic: struct inode *inode = file_inode(file); __u32 mnt_mask = real_mount(file->f_path.mnt)->mnt_fsnotify_mask; if (!(inode->i_sb->s_iflags & SB_I_ALLOW_HSM)) return false; that does random crap and looks up some "mount mask" and looks up the superblock flags. Why shouldn't we do this? BECAUSE NONE OF THIS MATTERS IF THE FILE HASN'T EVEN BEEN MARKED FOR CONTENT MATCHES! See why I'm shouting? You're doing insane things, and you're doing them for all the cases that DO NOT MATTER. You're doing all of this for the common case that doesn't want to see that kind of mindless overhead. You literally check for the "do I even care" *last*, when you finally do that fsnotify_object_watched() check that looks at the inode. But by then you have already wasted all that time and effort, and fsnotify_object_watched() is broken anyway, because it's stupidly designed to require that mnt_mask that isn't needed if you have properly marked each object individually. So what *should* you have? You should have had a per-file flag saying "Do I need to even call this crud at all", and have it in a location where you don't need to look at anything else. And fsnotify already basically has that flag, except it's mis-designed too. We have FMODE_NONOTIFY, which is the wrong way around (saying "don't notify", when that should just be the *default*), and the fsnotify layer uses it only to mark its own internal files so that it doesn't get called recursively. So that flag that *looks* sane and is in the right location is actually doing the wrong thing, because it's dealing with a rare special case, not the important cases that actually matter. So all of this readahead logic - and all of the read and write hooks - should be behind a simple "oh, this file doesn't have any notification stuff, so don't bother calling any fsnotify functions". So I think the pattern should be static inline bool fsnotify_file_has_pre_content_watches(struct file *file) { if (unlikely(file->f_mode & FMODE_NOTIFY)) return out_of_line_crud(file); return false; } so that the *only* thing that gets inlined is "does this file have any fsnotify stuff at all", and in the common case where that isn't true, we don't call any fsnotify functions, and we don't start looking at inodes or superblocks or anything pointless like that. THAT is the kind of code sequence we should have for unlikely cases. Not the "call fsnotify code to check for something unlikely". Not "look at file_inode(file)->i_sb->s_iflags" until it's *required*. Similarly, the actual open-time code SHOULD NEVER BE CALLED, because it should have the same kind of simple logic based on some simple flag either in the dentry or maybe in the inode. So that all those *normal* dentries and inodes that don't have watches don't get the overhead of calling into fsnotify code. Because yes, I do see all those function calls, and all those unnecessary indirect pointer accesses when I do profiles. And if I see fsnotify in my profiles when I don't have any notifications enabled, that really really *annoys* me. Linus
On Mon, Nov 11, 2024 at 10:52 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > On Mon, 11 Nov 2024 at 12:19, Josef Bacik <josef@toxicpanda.com> wrote: > > > > --- a/fs/namei.c > > +++ b/fs/namei.c > > @@ -3782,7 +3782,15 @@ static int do_open(struct nameidata *nd, > > + /* > > + * This permission hook is different than fsnotify_open_perm() hook. > > + * This is a pre-content hook that is called without sb_writers held > > + * and after the file was truncated. > > + */ > > + return fsnotify_file_area_perm(file, MAY_OPEN, &file->f_pos, 0); > > } > > Stop adding sh*t like this to the VFS layer. > > Seriously. I spend time and effort looking at profiles, and then > people who do *not* seem to spend the time and effort just willy nilly > add fsnotify and security events and show down basic paths. > > I'm going to NAK any new fsnotify and permission hooks unless people > show that they don't add any overhead. > FWIW we did work to eliminate overhead long before posting the new hooks. This prep work has already been merged back in v6.9: https://lore.kernel.org/linux-fsdevel/20240317184154.1200192-1-amir73il@gmail.com/ After working closely with Oliver and kernel test robot to eliminate the overhead of the pre-content hooks and on the way, also reduce overhead of the existing fanotify permission hooks that were merged back in the day. > Because I'm really really tired of having to wade through various > permission hooks in the profiles that I can not fix or optimize, > because those hoosk have no sane defined semantics, just "let user > space know". > > Yes, right now it's mostly just the security layer. But this really > looks to me like now fsnotify will be the same kind of pain. > > And that location is STUPID. Dammit, it is even *documented* to be > stupid. It's a "pre-content" hook that happens after the contents have > already been truncated. WTF? That's no "pre". > Yeh, I understand why it seems stupid and it may have been poorly documented, but there is actually a good reason for the location of this hook. The problem with the existing fsnotify_open_perm() hook is that with O_CREATE, open_last_lookups() takes sb_writers freeze protection (regardless if file with really created), so we cannot safely use this hook to start writing to file and fill its content. So the important part of the comment is: "This permission hook is different than fsnotify_open_perm() hook. This is a pre-content hook that is called without sb_writers held" The part that follows "and after the file was truncated", simply means that in case of O_TRUNC we won't need to fill file content, because the file will have zero size anyway. And for the record, it is not "pre-open" it is "pre-content-access", so the name may be confusing but it is not wrong. > I tried to follow the deep chain of inlines to see what actually ends > up happening, and it looks like if the *whole* filesystem has no > notify events at all, the fsnotify_sb_has_watchers() check will make > this mostly go away, except for all the D$ accesses needed just to > check for it. > > But even *one* entirely unrelated event will now force every single > open to typically call __fsnotify_parent() (or possibly "just" > fsnotify), because there's no sane "nobody cares about this dentry" > kind of thing. > > So effectively this is a new hook that gets called on every single > open call that nobody else cares about than you, and that people have > lived without for three decades. > That's exactly the reason for the commit a5e57b4d370c fsnotify: optimize the case of no permission event watchers It is supposed to reduce overhead to bare minimum for hooks of events from the class "permission" (FSNOTIFY_PRIO_CONTENT), which are typically only watched for Anti-malware software, so fsnotify_sb_has_priority_watchers(sb, FSNOTIFY_PRIO_CONTENT) is typically false on any given fs. And same for the new hooks for events of class "pre-content". fsnotify_sb_has_priority_watchers(sb, FSNOTIFY_PRIO_PRE_CONTENT) will only be true for fs of dedicated systems that run an HSM service, where the overhead of the hooks is not going to be a concern. > Stop it, or at least add the code to not do this all completely pointlessly. > > Because otherwise I will not take this kind of stuff any more. I just > spent time trying to figure out how to avoid the pointless cache > misses we did for every path component traversal. > > So I *really* don't want to see another pointless stupid fsnotify hook > in my profiles. I understand your frustration from crawling performance regressions, I really do. I am personally very grateful for the services of Oliver and his kernel test robot who test my development branches to catch regressions very soon in the development process. We may have missed things along the way and you may yet find more issues that justify another NAK, or more work, but you should know that a lot of care was taken to try to avoid inflicting pain on the system. I have been practically doing vfs prep and cleanup work together with Josef and Jan and Christian for at least a year before we got to post v1 of the pre-content patches. So all I ask in return is that you consider the possibility that this is not utter garbage and try to work with us to address your concerns. Pre-content hooks (and HSM) is a very powerful feature that many people are requesting and I believe that we will be able to deliver this feature without crapping all over the VFS. Thanks, Amir.
On Mon, 11 Nov 2024 at 15:22, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > See why I'm shouting? You're doing insane things, and you're doing > them for all the cases that DO NOT MATTER. You're doing all of this > for the common case that doesn't want to see that kind of mindless > overhead. Side note: I think as filesystem people, you guys are taught to think "IO is expensive, as long as you can avoid IO, things go fast". And that's largely true at a filesystem level. But on the VFS level, the common case is actually "everything is cached in memory, we're never calling down to the filesystem at all". And then IO isn't the issue. So on a VFS level, to a very close approximation, the only thing that matters is cache misses and mispredicted branches. (Indirect calls have always had some overhead, and Spectre made it much worse, so arguably indirect calls have become the third thing that matters). So in the VFS layer, we have ridiculous tricks like if (unlikely(!(inode->i_opflags & IOP_FASTPERM))) { if (likely(inode->i_op->permission)) return inode->i_op->permission(idmap, inode, mask); /* This gets set once for the inode lifetime */ spin_lock(&inode->i_lock); inode->i_opflags |= IOP_FASTPERM; spin_unlock(&inode->i_lock); } return generic_permission(idmap, inode, mask); in do_inode_permission, because it turns out that the IOP_FASTPERM flag means that we literally don't even need to dereference inode->i_op->permission (nasty chain of D$ accesses), and we can *only* look at accesses off the 'inode' pointer. Is this an extreme example? Yes. But the whole i_opflags kind of thing does end up mattering, exactly because it keeps the D$ footprint smaller. Linus
On Tue, Nov 12, 2024 at 12:22 AM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > On Mon, 11 Nov 2024 at 14:46, Josef Bacik <josef@toxicpanda.com> wrote: > > > > Did you see the patch that added the > > fsnotify_file_has_pre_content_watches() thing? > > No, because I had gotten to patch 6/11, and it added this open thing, > and there was no such thing in any of the patches before it. > > It looks like you added FSNOTIFY_PRE_CONTENT_EVENTS in 11/17. > > However, at no point does it look like you actually test it at open > time, so none of this seems to matter. > > As far as I can see, even at the end of the series, you will call the > fsnotify hook at open time even if there are no content watches on the > file. > > So apparently the fsnotify_file_has_pre_content_watches() is not > called when it should be, and when it *is* called, it's also doing > completely the wrong thing. > > Look, for basic operations THAT DON'T CARE, you now added a function > call to fsnotify_file_has_pre_content_watches(), that function call > looks at inode->i_sb->s_iflags (doing two D$ accesses that shouldn't > be done!), and then after that looks at the i_fsnotify_mask. > > THIS IS EXACTLY THE KIND OF GARBAGE I'M TALKING ABOUT. > > This code has been written by somebody who NEVER EVER looked at > profiles. You're following chains of pointers when you never should. > > Look, here's a very basic example of the kind of complete mis-design > I'm talking about: > > - we're doing a basic read() on a file that isn't being watched. > > - we want to maybe do read-ahead > > - the code does > > if (fsnotify_file_has_pre_content_watches(file)) > return fpin; > > to say that "don't do read-ahead". > > Fine, I understand the concept. But keep in mind that the common case > is presumably that there are no content watches. > > And even ignoring the "common case" issue, that's the one you want to > OPTIMIZE for. That's the case that matters for performance, because > clearly if there are content watches, you're going to go into "Go > Slow" mode anyway and not do pre-fetching. So even if content watches > are common on some load, they are clearly not the case you should do > performance optimization for. > > With me so far? > > So if THAT is the case that matters, then dammit, we shouldn't be > calling a function at all. > > And when calling the function, we shouldn't start out with this > completely broken logic: > > struct inode *inode = file_inode(file); > __u32 mnt_mask = real_mount(file->f_path.mnt)->mnt_fsnotify_mask; > > if (!(inode->i_sb->s_iflags & SB_I_ALLOW_HSM)) > return false; > > that does random crap and looks up some "mount mask" and looks up the > superblock flags. > > Why shouldn't we do this? > > BECAUSE NONE OF THIS MATTERS IF THE FILE HASN'T EVEN BEEN MARKED FOR > CONTENT MATCHES! > > See why I'm shouting? You're doing insane things, and you're doing > them for all the cases that DO NOT MATTER. You're doing all of this > for the common case that doesn't want to see that kind of mindless > overhead. > > You literally check for the "do I even care" *last*, when you finally > do that fsnotify_object_watched() check that looks at the inode. But > by then you have already wasted all that time and effort, and > fsnotify_object_watched() is broken anyway, because it's stupidly > designed to require that mnt_mask that isn't needed if you have > properly marked each object individually. > > So what *should* you have? > > You should have had a per-file flag saying "Do I need to even call > this crud at all", and have it in a location where you don't need to > look at anything else. > > And fsnotify already basically has that flag, except it's mis-designed > too. We have FMODE_NONOTIFY, which is the wrong way around (saying > "don't notify", when that should just be the *default*), and the > fsnotify layer uses it only to mark its own internal files so that it > doesn't get called recursively. So that flag that *looks* sane and is > in the right location is actually doing the wrong thing, because it's > dealing with a rare special case, not the important cases that > actually matter. > > So all of this readahead logic - and all of the read and write hooks - > should be behind a simple "oh, this file doesn't have any notification > stuff, so don't bother calling any fsnotify functions". > > So I think the pattern should be > > static inline bool fsnotify_file_has_pre_content_watches(struct file *file) > { > if (unlikely(file->f_mode & FMODE_NOTIFY)) > return out_of_line_crud(file); > return false; > } > I think that's a good idea for pre-content events, because it's fine to say that if the sb/mount was not watched by a pre-content event listener at the time of file open, then we do not care. The problem is that legacy inotify/fanotify watches can be added after file is open, so that is allegedly why this optimization was not done for fsnotify hooks in the past. Thanks, Amir.
On Mon, 11 Nov 2024 at 16:00, Amir Goldstein <amir73il@gmail.com> wrote: > > I think that's a good idea for pre-content events, because it's fine > to say that if the sb/mount was not watched by a pre-content event listener > at the time of file open, then we do not care. Right. > The problem is that legacy inotify/fanotify watches can be added after > file is open, so that is allegedly why this optimization was not done for > fsnotify hooks in the past. So honestly, even if the legacy fsnotify hooks can't look at the file flag, they could damn well look at an inode flag. And I'm not even convinced that we couldn't fix them to just look at a file flag, and say "tough luck, somebody opened that file before you started watching, you don't get to see what they did". So even if we don't look at a file->f_mode flag, the lergacy cases should look at i_fsnotify_mask, and do that *first*. IOW, not do it like fsnotify_object_watched() does now, which is just broken. Again, it looks at inode->i_sb->s_fsnotify_mask completely pointlessly, but it also does it much too late - it gets called after we've already called into the fsnotify() code and have messed up the I$ etc. The "linode->i_sb->s_fsnotify_mask" is not only an extra indirection, it should be very *literally* pointless. If some bit isn't set in i_sb->s_fsnotify_mask, then there should be no way to set that bit in inode->i_fsnotify_mask. So the only time we should access i_sb->s_fsnotify_mask is when i_notify_mask gets *modified*, not when it gets tested. But even if that silly and pointless i_sb->s_fsnotify_mask thing is removed, fsnotify_object_watched() is *still* wrong, because it requires that mnt_mask as an argument, which means that the caller now has to look it up - all this entirely pointless work that should never be done if the bit wasn't set in inode->i_fsnotify_mask. So I really think fsnotify is doing *everything* wrong. And I most certainly don't want to add more runtime hooks to *critical* code like open/read/write. Right now, many of the fsnotify things are for "metadata", ie for bigger file creation / removal / move etc. And yes, the "don't do this if there are no fsnotify watchers AT ALL" does actually end up meaning that most of the time I never see any of it in profiles, because the fsnotify_sb_has_watchers() culls out that case. And while the fsnotify_sb_has_watchers() thing is broken garbage and does too many indirections and is not testing the right thing, at least it's inlined and you don't get the function calls. That doesn't make fsnotify "right", but at least it's not in my face. I see the sb accesses, and I hate them, but it's usually at least hidden. Admittedly not as well hidden as it *should* be, since it does the access tests in the wrong order, but the old fsnotify_open() doesn't strike me as "terminally broken". It doesn't have a permission test after the open has already done things, and it's inlined enough that it isn't actively offensive. And most of the other fsnotify things have the same pattern - not great, but not actively offensive. These new patches make it in my face. So I do require that the *new* cases at least get it right. The fact that we have old code that is misdesigned and gets it wrong and should also be improved isn't an excuse to add *more* badly coded stuff. And yes, if somebody fixes the old fsnotify stuff to check just the i_fsnotify_mask in the inline function, and moves all the other silly checks out-of-line, that would be an improvement. I'd very much applaud that. But it's a separate thing from adding new hooks. Linus
On Tue, Nov 12, 2024 at 1:37 AM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > On Mon, 11 Nov 2024 at 16:00, Amir Goldstein <amir73il@gmail.com> wrote: > > > > I think that's a good idea for pre-content events, because it's fine > > to say that if the sb/mount was not watched by a pre-content event listener > > at the time of file open, then we do not care. > > Right. > > > The problem is that legacy inotify/fanotify watches can be added after > > file is open, so that is allegedly why this optimization was not done for > > fsnotify hooks in the past. > > So honestly, even if the legacy fsnotify hooks can't look at the file > flag, they could damn well look at an inode flag. > Legacy fanotify has a mount watch (FAN_MARK_MOUNT), which is the common way for Anti-malware to set watches on filesystems, so I am not sure what you are saying. > And I'm not even convinced that we couldn't fix them to just look at a > file flag, and say "tough luck, somebody opened that file before you > started watching, you don't get to see what they did". That would specifically break tail -f (for inotify) and probably many other tools, but as long as we also look at the inode flags (i_fsnotify_mask) and the dentry flags (DCACHE_FSNOTIFY_PARENT_WATCHED), then I think we may be able to get away with changing the semantics for open files on a fanotify mount watch. Specifically, I would really like to eliminate completely the cost of FAN_ACCESS_PERM event, which could be gated on file flag, because this is only for security/Anti-malware and I don't think this event is practically useful and it sure does not need to guarantee permission events to mount watchers on already open files. > > So even if we don't look at a file->f_mode flag, the lergacy cases > should look at i_fsnotify_mask, and do that *first*. > > IOW, not do it like fsnotify_object_watched() does now, which is just > broken. Again, it looks at inode->i_sb->s_fsnotify_mask completely > pointlessly, but it also does it much too late - it gets called after > we've already called into the fsnotify() code and have messed up the > I$ etc. > > The "linode->i_sb->s_fsnotify_mask" is not only an extra indirection, > it should be very *literally* pointless. If some bit isn't set in > i_sb->s_fsnotify_mask, then there should be no way to set that bit in > inode->i_fsnotify_mask. So the only time we should access > i_sb->s_fsnotify_mask is when i_notify_mask gets *modified*, not when > it gets tested. > i_fsnotify_mask is the cumulative mask of all inode watchers s_fsnotify_mask is *not* the cumulative of all i_fsnotify_mask s_fsnotify_mask is the cumulative mask of all sb watchers mnt_fsnotify_marks is the cumulative mask of all mount watchers > But even if that silly and pointless i_sb->s_fsnotify_mask thing is > removed, fsnotify_object_watched() is *still* wrong, because it > requires that mnt_mask as an argument, which means that the caller now > has to look it up - all this entirely pointless work that should never > be done if the bit wasn't set in inode->i_fsnotify_mask. > > So I really think fsnotify is doing *everything* wrong. Note the difference between fsnotify_sb_has_watchers() and fsnotify_object_watched(). The former is an early optimization gate that checks if there are any inode/mount/sb watchers (per class) on the filesystem, regardless of specific events and specific target inode/file. We could possibly further optimize fsnotify_sb_has_watchers() to avoid access to ->s_fsnotify_info by setting sb flag (for each priority class). The latter checks if any inode/mount/sb are interested in a specific event on the said object. In upstream code, fsnotify_object_watched() is always gated behind fsnotify_sb_has_watchers(), which tries to prevent the indirect call. The new fsnotify_file_has_pre_content_watches() helper could have checked fsnotify_sb_has_priority_watchers(sb, FSNOTIFY_PRIO_CONTENT); but it is better to gate by file flag as you suggested. > > And I most certainly don't want to add more runtime hooks to > *critical* code like open/read/write. > > Right now, many of the fsnotify things are for "metadata", ie for > bigger file creation / removal / move etc. And yes, the "don't do this > if there are no fsnotify watchers AT ALL" does actually end up meaning > that most of the time I never see any of it in profiles, because the > fsnotify_sb_has_watchers() culls out that case. > > And while the fsnotify_sb_has_watchers() thing is broken garbage and > does too many indirections and is not testing the right thing, at > least it's inlined and you don't get the function calls. > > That doesn't make fsnotify "right", but at least it's not in my face. > I see the sb accesses, and I hate them, but it's usually at least > hidden. Admittedly not as well hidden as it *should* be, since it does > the access tests in the wrong order, but the old fsnotify_open() > doesn't strike me as "terminally broken". > > It doesn't have a permission test after the open has already done > things, and it's inlined enough that it isn't actively offensive. > > And most of the other fsnotify things have the same pattern - not > great, but not actively offensive. > > These new patches make it in my face. > > So I do require that the *new* cases at least get it right. The fact > that we have old code that is misdesigned and gets it wrong and should > also be improved isn't an excuse to add *more* badly coded stuff. > > And yes, if somebody fixes the old fsnotify stuff to check just the > i_fsnotify_mask in the inline function, and moves all the other silly > checks out-of-line, that would be an improvement. I'd very much > applaud that. But it's a separate thing from adding new hooks. We will do both. Thanks, Amir.
On Tue 12-11-24 09:11:32, Amir Goldstein wrote: > On Tue, Nov 12, 2024 at 1:37 AM Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > On Mon, 11 Nov 2024 at 16:00, Amir Goldstein <amir73il@gmail.com> wrote: > > > > > > I think that's a good idea for pre-content events, because it's fine > > > to say that if the sb/mount was not watched by a pre-content event listener > > > at the time of file open, then we do not care. > > > > Right. > > > > > The problem is that legacy inotify/fanotify watches can be added after > > > file is open, so that is allegedly why this optimization was not done for > > > fsnotify hooks in the past. > > > > So honestly, even if the legacy fsnotify hooks can't look at the file > > flag, they could damn well look at an inode flag. > > Legacy fanotify has a mount watch (FAN_MARK_MOUNT), > which is the common way for Anti-malware to set watches on > filesystems, so I am not sure what you are saying. > > > And I'm not even convinced that we couldn't fix them to just look at a > > file flag, and say "tough luck, somebody opened that file before you > > started watching, you don't get to see what they did". > > That would specifically break tail -f (for inotify) and probably many other > tools, but as long as we also look at the inode flags (i_fsnotify_mask) > and the dentry flags (DCACHE_FSNOTIFY_PARENT_WATCHED), > then I think we may be able to get away with changing the semantics > for open files on a fanotify mount watch. Yes, I agree we cannot afford to generate FS_MODIFY event only if the mark was placed after file open. There's too much stuff in userspace depending on this since this behavior dates back to inotify interface sometime in 2010 or so. > Specifically, I would really like to eliminate completely the cost of > FAN_ACCESS_PERM event, which could be gated on file flag, because > this is only for security/Anti-malware and I don't think this event is > practically > useful and it sure does not need to guarantee permission events to mount > watchers on already open files. For traditional fanotify permission events I agree generating them only if the mark was placed before open is likely fine but we'll have to try and see whether something breaks. For the new pre-content events I like the per-file flag as Linus suggested. That should indeed save us some cache misses in some fast paths. Honza
On Mon 11-11-24 16:37:30, Linus Torvalds wrote: > On Mon, 11 Nov 2024 at 16:00, Amir Goldstein <amir73il@gmail.com> wrote: > > > > I think that's a good idea for pre-content events, because it's fine > > to say that if the sb/mount was not watched by a pre-content event listener > > at the time of file open, then we do not care. > > Right. > > > The problem is that legacy inotify/fanotify watches can be added after > > file is open, so that is allegedly why this optimization was not done for > > fsnotify hooks in the past. > > So honestly, even if the legacy fsnotify hooks can't look at the file > flag, they could damn well look at an inode flag. > > And I'm not even convinced that we couldn't fix them to just look at a > file flag, and say "tough luck, somebody opened that file before you > started watching, you don't get to see what they did". > > So even if we don't look at a file->f_mode flag, the lergacy cases > should look at i_fsnotify_mask, and do that *first*. > > IOW, not do it like fsnotify_object_watched() does now, which is just > broken. Again, it looks at inode->i_sb->s_fsnotify_mask completely > pointlessly, but it also does it much too late - it gets called after > we've already called into the fsnotify() code and have messed up the > I$ etc. > > The "linode->i_sb->s_fsnotify_mask" is not only an extra indirection, > it should be very *literally* pointless. If some bit isn't set in > i_sb->s_fsnotify_mask, then there should be no way to set that bit in > inode->i_fsnotify_mask. So the only time we should access > i_sb->s_fsnotify_mask is when i_notify_mask gets *modified*, not when > it gets tested. Thanks for explanation but what you write here and below seems to me as if you are not aware of some features fanotify API provides (for many years). With fanotify you can place a mark not only on a file / directory but you can place it also on a mount point or a superblock. In that case any operation of appropriate type happening through that mount point or on that superblock should deliver the event to the notification group that placed the mark. So for example we check inode->i_sb->s_fsnotify_mask on open to be able to decide whether somebody asked to get notifications about *all* opens on that superblock and generate appropriate events. These features are there since day 1 of fanotify (at least for mountpoints, superblocks were added later) and quite some userspace depends on them for doing filesystem-wide event monitoring. We could somehow cache the fact that someone placed a mark on the superblock in the inode / struct file but I don't see how to keep such cache consistent with marks that are attached to the superblock without incurring the very same cache misses we are trying to avoid. I do like your idea of caching whether somebody wants the notification in struct file as that way we can avoid the cache misses for the new pre-content hooks and possibly in hooks for the traditional fanotify permission events. But I don't see how we could possibly avoid these cache misses for the classical notification events like FAN_OPEN, FAN_ACCESS, FAN_MODIFY, ... Honza
On Tue, Nov 12, 2024 at 2:55 PM Jan Kara <jack@suse.cz> wrote: > > On Tue 12-11-24 09:11:32, Amir Goldstein wrote: > > On Tue, Nov 12, 2024 at 1:37 AM Linus Torvalds > > <torvalds@linux-foundation.org> wrote: > > > On Mon, 11 Nov 2024 at 16:00, Amir Goldstein <amir73il@gmail.com> wrote: > > > > > > > > I think that's a good idea for pre-content events, because it's fine > > > > to say that if the sb/mount was not watched by a pre-content event listener > > > > at the time of file open, then we do not care. > > > > > > Right. > > > > > > > The problem is that legacy inotify/fanotify watches can be added after > > > > file is open, so that is allegedly why this optimization was not done for > > > > fsnotify hooks in the past. > > > > > > So honestly, even if the legacy fsnotify hooks can't look at the file > > > flag, they could damn well look at an inode flag. > > > > Legacy fanotify has a mount watch (FAN_MARK_MOUNT), > > which is the common way for Anti-malware to set watches on > > filesystems, so I am not sure what you are saying. > > > > > And I'm not even convinced that we couldn't fix them to just look at a > > > file flag, and say "tough luck, somebody opened that file before you > > > started watching, you don't get to see what they did". > > > > That would specifically break tail -f (for inotify) and probably many other > > tools, but as long as we also look at the inode flags (i_fsnotify_mask) > > and the dentry flags (DCACHE_FSNOTIFY_PARENT_WATCHED), > > then I think we may be able to get away with changing the semantics > > for open files on a fanotify mount watch. > > Yes, I agree we cannot afford to generate FS_MODIFY event only if the mark > was placed after file open. There's too much stuff in userspace depending > on this since this behavior dates back to inotify interface sometime in > 2010 or so. > > > Specifically, I would really like to eliminate completely the cost of > > FAN_ACCESS_PERM event, which could be gated on file flag, because > > this is only for security/Anti-malware and I don't think this event is > > practically > > useful and it sure does not need to guarantee permission events to mount > > watchers on already open files. > > For traditional fanotify permission events I agree generating them only if > the mark was placed before open is likely fine but we'll have to try and > see whether something breaks. For the new pre-content events I like the > per-file flag as Linus suggested. That should indeed save us some cache > misses in some fast paths. FWIW, attached a patch that implements FMODE_NOTIFY_PERM I have asked Oliver to run his performance tests to see if we can observe an improvement with existing workloads, but is sure is going to be useful for pre-content events. For example, here is what the pre content helper looks like after I adapted Josef's patches to use the flag: static inline bool fsnotify_file_has_pre_content_watches(struct file *file) { if (!(file->f_mode & FMODE_NOTIFY_PERM)) return false; if (!(file_inode(file)->i_sb->s_iflags & SB_I_ALLOW_HSM)) return false; return fsnotify_file_object_watched(file, FSNOTIFY_PRE_CONTENT_EVENTS); } Thanks, Amir.
On Mon, Nov 11, 2024 at 03:22:06PM -0800, Linus Torvalds wrote: > On Mon, 11 Nov 2024 at 14:46, Josef Bacik <josef@toxicpanda.com> wrote: > > > > Did you see the patch that added the > > fsnotify_file_has_pre_content_watches() thing? > > No, because I had gotten to patch 6/11, and it added this open thing, > and there was no such thing in any of the patches before it. > > It looks like you added FSNOTIFY_PRE_CONTENT_EVENTS in 11/17. > > However, at no point does it look like you actually test it at open > time, so none of this seems to matter. > > As far as I can see, even at the end of the series, you will call the > fsnotify hook at open time even if there are no content watches on the > file. > > So apparently the fsnotify_file_has_pre_content_watches() is not > called when it should be, and when it *is* called, it's also doing > completely the wrong thing. > > Look, for basic operations THAT DON'T CARE, you now added a function > call to fsnotify_file_has_pre_content_watches(), that function call > looks at inode->i_sb->s_iflags (doing two D$ accesses that shouldn't > be done!), and then after that looks at the i_fsnotify_mask. > > THIS IS EXACTLY THE KIND OF GARBAGE I'M TALKING ABOUT. > > This code has been written by somebody who NEVER EVER looked at > profiles. You're following chains of pointers when you never should. > > Look, here's a very basic example of the kind of complete mis-design > I'm talking about: > > - we're doing a basic read() on a file that isn't being watched. > > - we want to maybe do read-ahead > > - the code does > > if (fsnotify_file_has_pre_content_watches(file)) > return fpin; > > to say that "don't do read-ahead". > > Fine, I understand the concept. But keep in mind that the common case > is presumably that there are no content watches. > > And even ignoring the "common case" issue, that's the one you want to > OPTIMIZE for. That's the case that matters for performance, because > clearly if there are content watches, you're going to go into "Go > Slow" mode anyway and not do pre-fetching. So even if content watches > are common on some load, they are clearly not the case you should do > performance optimization for. > > With me so far? > > So if THAT is the case that matters, then dammit, we shouldn't be > calling a function at all. > > And when calling the function, we shouldn't start out with this > completely broken logic: > > struct inode *inode = file_inode(file); > __u32 mnt_mask = real_mount(file->f_path.mnt)->mnt_fsnotify_mask; > > if (!(inode->i_sb->s_iflags & SB_I_ALLOW_HSM)) > return false; > > that does random crap and looks up some "mount mask" and looks up the > superblock flags. > > Why shouldn't we do this? > > BECAUSE NONE OF THIS MATTERS IF THE FILE HASN'T EVEN BEEN MARKED FOR > CONTENT MATCHES! > > See why I'm shouting? You're doing insane things, and you're doing > them for all the cases that DO NOT MATTER. You're doing all of this > for the common case that doesn't want to see that kind of mindless > overhead. > > You literally check for the "do I even care" *last*, when you finally > do that fsnotify_object_watched() check that looks at the inode. But > by then you have already wasted all that time and effort, and > fsnotify_object_watched() is broken anyway, because it's stupidly > designed to require that mnt_mask that isn't needed if you have > properly marked each object individually. > > So what *should* you have? > > You should have had a per-file flag saying "Do I need to even call > this crud at all", and have it in a location where you don't need to > look at anything else. > > And fsnotify already basically has that flag, except it's mis-designed > too. We have FMODE_NONOTIFY, which is the wrong way around (saying > "don't notify", when that should just be the *default*), and the > fsnotify layer uses it only to mark its own internal files so that it > doesn't get called recursively. So that flag that *looks* sane and is > in the right location is actually doing the wrong thing, because it's > dealing with a rare special case, not the important cases that > actually matter. > > So all of this readahead logic - and all of the read and write hooks - > should be behind a simple "oh, this file doesn't have any notification > stuff, so don't bother calling any fsnotify functions". > > So I think the pattern should be > > static inline bool fsnotify_file_has_pre_content_watches(struct file *file) > { > if (unlikely(file->f_mode & FMODE_NOTIFY)) > return out_of_line_crud(file); > return false; > } > > so that the *only* thing that gets inlined is "does this file have any > fsnotify stuff at all", and in the common case where that isn't true, > we don't call any fsnotify functions, and we don't start looking at > inodes or superblocks or anything pointless like that. > > THAT is the kind of code sequence we should have for unlikely cases. > Not the "call fsnotify code to check for something unlikely". Not > "look at file_inode(file)->i_sb->s_iflags" until it's *required*. > > Similarly, the actual open-time code SHOULD NEVER BE CALLED, because > it should have the same kind of simple logic based on some simple flag > either in the dentry or maybe in the inode. So that all those *normal* > dentries and inodes that don't have watches don't get the overhead of > calling into fsnotify code. > > Because yes, I do see all those function calls, and all those > unnecessary indirect pointer accesses when I do profiles. And if I see > fsnotify in my profiles when I don't have any notifications enabled, > that really really *annoys* me. > There are good suggestions here, and decent analysis, Amir has followed up with a patch that will improve things, and that's good. But this was an entirely inappropriate way to communicate your point, even with people who have been here a while and are used to being yelled at. Throughout this email you have suggested that myself, Amir, and Jan have never looked at profiles. You go so far as to suggest that we have no idea what we're doing and don't understand common case optimizations. These are the sort of comments that are unhelpful and put most people on the defensive, and make them unwilling to listen to your suggestions and feedback. These are the sort of comments that make people work very hard to exit this community. Are you wrong? No. We all get tunnel vision, I know I was deeply focused on making the page fault case have as little impact as possible, but I definitely didn't consider the indirect access case. I don't run with mitigations on, and frankly I am a file system person, I know if you're here we're going to do IO and that's going to be the bad part. That's why we do code review, because we're all human and we all miss things. But being dressed down like this for missing a better way, because lets be honest what I did wasn't earth shatteringly bad, is not helpful. It is actively harmful, and it makes me not want to work on this stuff anymore. We constantly have discussions about how do we bring in new people and how do we train up new maintainers, without looking at how do we keep maintainers and how do we keep developers. If you're losing people like me, gaining new people is going to be an even taller order. Thanks, Josef
On Tue, 12 Nov 2024 at 07:24, Josef Bacik <josef@toxicpanda.com> wrote: > > But this was an entirely inappropriate way to communicate your point, even with > people who have been here a while and are used to being yelled at. Fair enough. I was probably way too frustrated, because I spent some of the last few weeks literally trying to remove two cache misses from the permission checking path. And those two cache misses were for features that are in POSIX made worse because of uid translations infrastructure used for containers. Those are both things that people actually *use* in major ways. Admittedly the particular optimization was exactly to _not_ bother with looking up the exact uid details when we could tell early that it was all unnecessary, but the point was that this was to optimize something real, something important, and not some special case. And here's an example of a profile I've been looking at (this is a real load: the profile is literally my "make allmodconfig" build with not a lot of actual rebuilding needed): 1.30% avc_has_perm_noaudit 1.24% selinux_inode_permission 1.18% link_path_walk.part.0.constprop.0 0.99% btrfs_getattr 0.82% clear_page_rep 0.82% security_inode_permission 0.60% dput 0.60% path_lookupat 0.60% __check_object_size 0.54% strncpy_from_user 0.51% inode_permission 0.50% generic_permission 0.47% kmem_cache_alloc_noprof 0.47% step_into 0.46% btrfs_permission 0.45% cp_new_stat 0.44% filename_lookup and hey, you go "most of those are just around half a percent". Sure. But add those things up, and you'll see that they add up to about 12%. And yes, that 12% is 1/8th of the whole load. So it's not some "just 12% of the kernel footprint". It's literally 12% of one of the main loads I run day-to-day, which is why I care about these paths so deeply. And look at the two top offenders. No, they aren't fsnotify. But guess what they are? They are hooks in the VFS layer that the VFS code cannot do anything about and cannot optimize as a result. They do ABSOLUTELY NOTHING on this load, and they add no actual value. If they had some kind of "I don't have any special security label" test, the two top offenders (and down that list you can find a third one) would likely just not exist at all. But they aren't "real" VFS functions, they are just hooks into a black hole with random semantics that is set by user-supplied tables, so we don't have that. So this is why I'm putting my foot down. No more badly coded and badly designed hooks that the VFS layer can't just optimize away for the common case. I can't fix the existing issues. But I can say "no more". If you want new hooks, they had better have effectively *ZERO* overhead when they aren't relevant. And yes, I was too forceful. Sorry. But my foot stays down. If you want a specialty hook for specialty uses, that hook needs to be so *fundamentally* low-cost that I simply don't need to worry about it. It needs to have a flag that doesn't take a cache miss that says "Nobody cares", and doesn't call out to random pointless functions that cause I$ misses either. I really wish I had made that requirement for the security people all those years ago. Because 99% of the time, all that stupid selinux noise is literally worthless and does nothing. But we don't have the flag that says "nothing to see". Linus
diff --git a/fs/namei.c b/fs/namei.c index 4a4a22a08ac2..b49fb1f80c0c 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -3782,7 +3782,15 @@ static int do_open(struct nameidata *nd, } if (do_truncate) mnt_drop_write(nd->path.mnt); - return error; + if (error) + return error; + + /* + * This permission hook is different than fsnotify_open_perm() hook. + * This is a pre-content hook that is called without sb_writers held + * and after the file was truncated. + */ + return fsnotify_file_area_perm(file, MAY_OPEN, &file->f_pos, 0); } /** diff --git a/include/linux/fsnotify.h b/include/linux/fsnotify.h index 22150e5797c5..1e87a54b88b6 100644 --- a/include/linux/fsnotify.h +++ b/include/linux/fsnotify.h @@ -171,6 +171,8 @@ static inline int fsnotify_pre_content(const struct file *file, /* * fsnotify_file_area_perm - permission hook before access of file range + * + * Called post open with access range [0..0]. */ static inline int fsnotify_file_area_perm(struct file *file, int perm_mask, const loff_t *ppos, size_t count) @@ -185,7 +187,7 @@ static inline int fsnotify_file_area_perm(struct file *file, int perm_mask, /* * read()/write and other types of access generate pre-content events. */ - if (perm_mask & (MAY_READ | MAY_WRITE | MAY_ACCESS)) { + if (perm_mask & (MAY_READ | MAY_WRITE | MAY_ACCESS | MAY_OPEN)) { int ret = fsnotify_pre_content(file, ppos, count); if (ret)