diff mbox series

[RFC] fanotify: allow to set errno in FAN_DENY permission response

Message ID 20231208080135.4089880-1-amir73il@gmail.com (mailing list archive)
State New
Headers show
Series [RFC] fanotify: allow to set errno in FAN_DENY permission response | expand

Commit Message

Amir Goldstein Dec. 8, 2023, 8:01 a.m. UTC
With FAN_DENY response, user trying to perform the filesystem operation
gets an error with errno set to EPERM.

It is useful for hierarchical storage management (HSM) service to be able
to deny access for reasons more diverse than EPERM, for example EAGAIN,
if HSM could retry the operation later.

Allow userspace to response to permission events with the response value
FAN_DENY_ERRNO(errno), instead of FAN_DENY to return a custom error.

The change in fanotify_response is backward compatible, because errno is
written in the high 8 bits of the 32bit response field and old kernels
reject respose value with high bits set.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
---

Jan,

I remember discussing this API change with you and I even remember that
you preferred to send the errno as an extra info record with the response,
but I could not find any private or public email about this.

So let that be my first public proposal of the FAN_DENY_ERRNO(errno) API.
I would like you to recondsider the inline errno proposal vs. an extra
errno info record, seeing how simple and fundumental the inline errno
response is.

I should mention that I was specifically asked about providing this
functionality by developers from Meta, because it is needed for their
use case.

What I am not sure about is whether we should allow FAN_DENY_ERRNO
response to any permission event, also the legacy ones, or we should
limit the FAN_DENY_ERRNO response to new HSM events or new HSM group
class (FAN_CLASS_PRE_PATH).

I was aiming at the latter option, but I can't realy think of a good
enough reason to deny this functionality from legacy permission events
as the way that the API is extended is fully backward compat.

If you also think that this could be useful for legacy permission
events and if the patch looks reasonable to you, then you may go
ahead and consider it for 6.8 already as it is independent from the
rest of the HSM event patches.

Thoughts?

Thanks,
Amir.


 fs/notify/fanotify/fanotify.c      |  7 +++++--
 fs/notify/fanotify/fanotify_user.c | 10 +++++++---
 include/linux/fanotify.h           |  9 ++++++++-
 include/uapi/linux/fanotify.h      |  7 +++++++
 4 files changed, 27 insertions(+), 6 deletions(-)

Comments

Jan Kara Dec. 13, 2023, 5:28 p.m. UTC | #1
On Fri 08-12-23 10:01:35, Amir Goldstein wrote:
> With FAN_DENY response, user trying to perform the filesystem operation
> gets an error with errno set to EPERM.
> 
> It is useful for hierarchical storage management (HSM) service to be able
> to deny access for reasons more diverse than EPERM, for example EAGAIN,
> if HSM could retry the operation later.
> 
> Allow userspace to response to permission events with the response value
> FAN_DENY_ERRNO(errno), instead of FAN_DENY to return a custom error.
> 
> The change in fanotify_response is backward compatible, because errno is
> written in the high 8 bits of the 32bit response field and old kernels
> reject respose value with high bits set.
> 
> Signed-off-by: Amir Goldstein <amir73il@gmail.com>

So a couple of comments that spring to my mind when I'm looking into this
now (partly maybe due to my weak memory ;):

1) Do we still need the EAGAIN return? I think we have mostly dealt with
freezing deadlocks in another way, didn't we?

2) If answer to 1) is yes, then there is a second question - do we expect
the errors to propagate back to the unsuspecting application doing say
read(2) syscall? Because I don't think that will fly well with a big
majority of applications which basically treat *any* error from read(2) as
fatal. This is also related to your question about standard permission
events. Consumers of these error numbers are going to be random
applications and I see a potential for rather big confusion arising there
(like read(1) returning EINVAL or EBADF and now you wonder why the hell
until you go debug the kernel and find out the error is coming out of
fanotify handler). And the usecase is not quite clear to me for ordinary
fanotify permission events (while I have no doubts about creativity of
implementors of fanotify handlers ;)).

3) Given the potential for confusion, maybe we should stay conservative and
only allow additional EAGAIN error instead of arbitrary errno if we need it?

I'm leaving the API question aside for a moment until I have a clearer
picture of what we actually want to implement :).

								Honza
Amir Goldstein Dec. 13, 2023, 7:09 p.m. UTC | #2
On Wed, Dec 13, 2023 at 7:28 PM Jan Kara <jack@suse.cz> wrote:
>
> On Fri 08-12-23 10:01:35, Amir Goldstein wrote:
> > With FAN_DENY response, user trying to perform the filesystem operation
> > gets an error with errno set to EPERM.
> >
> > It is useful for hierarchical storage management (HSM) service to be able
> > to deny access for reasons more diverse than EPERM, for example EAGAIN,
> > if HSM could retry the operation later.
> >
> > Allow userspace to response to permission events with the response value
> > FAN_DENY_ERRNO(errno), instead of FAN_DENY to return a custom error.
> >
> > The change in fanotify_response is backward compatible, because errno is
> > written in the high 8 bits of the 32bit response field and old kernels
> > reject respose value with high bits set.
> >
> > Signed-off-by: Amir Goldstein <amir73il@gmail.com>
>
> So a couple of comments that spring to my mind when I'm looking into this
> now (partly maybe due to my weak memory ;):
>
> 1) Do we still need the EAGAIN return? I think we have mostly dealt with
> freezing deadlocks in another way, didn't we?

I was thinking about EAGAIN on account of the HSM not being able to
download the file ATM.

There are a bunch of error codes that are typical for network filesystems, e.g.
ETIMEDOUT, ENOTCONN, ECONNRESET which could be relevant to
HSM failures.

>
> 2) If answer to 1) is yes, then there is a second question - do we expect
> the errors to propagate back to the unsuspecting application doing say
> read(2) syscall? Because I don't think that will fly well with a big
> majority of applications which basically treat *any* error from read(2) as
> fatal. This is also related to your question about standard permission
> events. Consumers of these error numbers are going to be random
> applications and I see a potential for rather big confusion arising there
> (like read(1) returning EINVAL or EBADF and now you wonder why the hell
> until you go debug the kernel and find out the error is coming out of
> fanotify handler). And the usecase is not quite clear to me for ordinary
> fanotify permission events (while I have no doubts about creativity of
> implementors of fanotify handlers ;)).
>

That's a good question.
I prefer to delegate your question to the prospect users of the feature.

Josef, which errors did your use case need this feature for?

> 3) Given the potential for confusion, maybe we should stay conservative and
> only allow additional EAGAIN error instead of arbitrary errno if we need it?
>

I know I was planning to use this for EDQUOT error (from FAN_PRE_MODIFY),
but I certainly wouldn't mind restricting the set of custom errors.
I think it makes sense. The hard part is to agree on this set of errors.

> I'm leaving the API question aside for a moment until I have a clearer
> picture of what we actually want to implement :).

Fair enough ;)

Thanks,
Amir.
Josef Bacik Dec. 15, 2023, 3:31 p.m. UTC | #3
On Wed, Dec 13, 2023 at 09:09:30PM +0200, Amir Goldstein wrote:
> On Wed, Dec 13, 2023 at 7:28 PM Jan Kara <jack@suse.cz> wrote:
> >
> > On Fri 08-12-23 10:01:35, Amir Goldstein wrote:
> > > With FAN_DENY response, user trying to perform the filesystem operation
> > > gets an error with errno set to EPERM.
> > >
> > > It is useful for hierarchical storage management (HSM) service to be able
> > > to deny access for reasons more diverse than EPERM, for example EAGAIN,
> > > if HSM could retry the operation later.
> > >
> > > Allow userspace to response to permission events with the response value
> > > FAN_DENY_ERRNO(errno), instead of FAN_DENY to return a custom error.
> > >
> > > The change in fanotify_response is backward compatible, because errno is
> > > written in the high 8 bits of the 32bit response field and old kernels
> > > reject respose value with high bits set.
> > >
> > > Signed-off-by: Amir Goldstein <amir73il@gmail.com>
> >
> > So a couple of comments that spring to my mind when I'm looking into this
> > now (partly maybe due to my weak memory ;):
> >
> > 1) Do we still need the EAGAIN return? I think we have mostly dealt with
> > freezing deadlocks in another way, didn't we?
> 
> I was thinking about EAGAIN on account of the HSM not being able to
> download the file ATM.
> 
> There are a bunch of error codes that are typical for network filesystems, e.g.
> ETIMEDOUT, ENOTCONN, ECONNRESET which could be relevant to
> HSM failures.
> 
> >
> > 2) If answer to 1) is yes, then there is a second question - do we expect
> > the errors to propagate back to the unsuspecting application doing say
> > read(2) syscall? Because I don't think that will fly well with a big
> > majority of applications which basically treat *any* error from read(2) as
> > fatal. This is also related to your question about standard permission
> > events. Consumers of these error numbers are going to be random
> > applications and I see a potential for rather big confusion arising there
> > (like read(1) returning EINVAL or EBADF and now you wonder why the hell
> > until you go debug the kernel and find out the error is coming out of
> > fanotify handler). And the usecase is not quite clear to me for ordinary
> > fanotify permission events (while I have no doubts about creativity of
> > implementors of fanotify handlers ;)).
> >
> 
> That's a good question.
> I prefer to delegate your question to the prospect users of the feature.
> 
> Josef, which errors did your use case need this feature for?
> 
> > 3) Given the potential for confusion, maybe we should stay conservative and
> > only allow additional EAGAIN error instead of arbitrary errno if we need it?
> >
> 
> I know I was planning to use this for EDQUOT error (from FAN_PRE_MODIFY),
> but I certainly wouldn't mind restricting the set of custom errors.
> I think it makes sense. The hard part is to agree on this set of errors.
> 

I'm all for flexibility here.

We're going to have 2 classes of applications interacting with HSM backed
storage, normal applications and applications that know they're backed by HSM.
The normal applications are just going to crash if they get an error on read(2),
it doesn't matter what errno it is.  The second class would have different
things they'd want to do in the face of different errors, and that's what this
patchset is targeting.  We can limit it to a few errno's if that makes you feel
better, but having more than just one would be helpful.  Thanks,

Josef
Amir Goldstein Dec. 15, 2023, 4:50 p.m. UTC | #4
On Fri, Dec 15, 2023 at 5:31 PM Josef Bacik <josef@toxicpanda.com> wrote:
>
> On Wed, Dec 13, 2023 at 09:09:30PM +0200, Amir Goldstein wrote:
> > On Wed, Dec 13, 2023 at 7:28 PM Jan Kara <jack@suse.cz> wrote:
> > >
> > > On Fri 08-12-23 10:01:35, Amir Goldstein wrote:
> > > > With FAN_DENY response, user trying to perform the filesystem operation
> > > > gets an error with errno set to EPERM.
> > > >
> > > > It is useful for hierarchical storage management (HSM) service to be able
> > > > to deny access for reasons more diverse than EPERM, for example EAGAIN,
> > > > if HSM could retry the operation later.
> > > >
> > > > Allow userspace to response to permission events with the response value
> > > > FAN_DENY_ERRNO(errno), instead of FAN_DENY to return a custom error.
> > > >
> > > > The change in fanotify_response is backward compatible, because errno is
> > > > written in the high 8 bits of the 32bit response field and old kernels
> > > > reject respose value with high bits set.
> > > >
> > > > Signed-off-by: Amir Goldstein <amir73il@gmail.com>
> > >
> > > So a couple of comments that spring to my mind when I'm looking into this
> > > now (partly maybe due to my weak memory ;):
> > >
> > > 1) Do we still need the EAGAIN return? I think we have mostly dealt with
> > > freezing deadlocks in another way, didn't we?
> >
> > I was thinking about EAGAIN on account of the HSM not being able to
> > download the file ATM.
> >
> > There are a bunch of error codes that are typical for network filesystems, e.g.
> > ETIMEDOUT, ENOTCONN, ECONNRESET which could be relevant to
> > HSM failures.
> >
> > >
> > > 2) If answer to 1) is yes, then there is a second question - do we expect
> > > the errors to propagate back to the unsuspecting application doing say
> > > read(2) syscall? Because I don't think that will fly well with a big
> > > majority of applications which basically treat *any* error from read(2) as
> > > fatal. This is also related to your question about standard permission
> > > events. Consumers of these error numbers are going to be random
> > > applications and I see a potential for rather big confusion arising there
> > > (like read(1) returning EINVAL or EBADF and now you wonder why the hell
> > > until you go debug the kernel and find out the error is coming out of
> > > fanotify handler). And the usecase is not quite clear to me for ordinary
> > > fanotify permission events (while I have no doubts about creativity of
> > > implementors of fanotify handlers ;)).
> > >
> >
> > That's a good question.
> > I prefer to delegate your question to the prospect users of the feature.
> >
> > Josef, which errors did your use case need this feature for?
> >
> > > 3) Given the potential for confusion, maybe we should stay conservative and
> > > only allow additional EAGAIN error instead of arbitrary errno if we need it?
> > >
> >
> > I know I was planning to use this for EDQUOT error (from FAN_PRE_MODIFY),
> > but I certainly wouldn't mind restricting the set of custom errors.
> > I think it makes sense. The hard part is to agree on this set of errors.
> >
>
> I'm all for flexibility here.
>
> We're going to have 2 classes of applications interacting with HSM backed
> storage, normal applications and applications that know they're backed by HSM.
> The normal applications are just going to crash if they get an error on read(2),
> it doesn't matter what errno it is.  The second class would have different
> things they'd want to do in the face of different errors, and that's what this
> patchset is targeting.  We can limit it to a few errno's if that makes you feel
> better, but having more than just one would be helpful.
>

Ok. In another email I got from your colleagues, they listed:
EIO, EAGAIN, ENOSPC as possible errors to return.
I added EDQUOT for our in house use case.

Those are all valid errors for write(2) and some are valid for read(2).
ENOSPC/EDQUOT make a lot of sense for HSM for read(2), but could
be surprising to applications not aware of HSM.
I think it wouldn't be that bad if we let HSM decide which of those errors
to return for FAN_PRE_ACCESS as opposed to FAN_PRE_MODIFY.

But given that we do want to limit the chance of abusing this feature,
perhaps it would be best to limit the error codes to known error codes
for write(2) IO failures (i.e. not EBADF, not EFAULT) and allow returning
FAN_DENY_ERRNO only for the new FAN_PRE_{ACCESS,MODIFY}
HSM events.

IOW, FAN_{OPEN,ACCESS}_PERM - no FAN_DENY_ERRNO for you!

Does that sound good to you?

Furthermore, we can start with allowing a very limited set of errors
and extend it in the future, on case by case basis.

The way that this could be manageable is if we provide userspace
a way to test for supported return codes.

There is already a simple method that we used for testing FAN_INFO
records type support -
After fan_fd = fanotify_init(), userspace can write a "test" fanotify_response
to fan_fd with fanotify_response.fd=FAN_NOFD.

When setting fanotify_response.fd=FAN_DENY, this would return ENOENT,
but with fanotify_response.fd=FAN_DENY_ERRNO(EIO), upstream would
return EINVAL.

This opens the possibility of allowing, say, EIO, EAGAIN in the first release
and ENOSPC, EDQUOT in the following release.

The advantage in this method is that it is very simple and already working
correctly for old kernels.

The downside is that this simple method does not allow checking for allowed
errors per specific event type, so if we decide that we do want to
allow returning
FAN_DENY_ERRNO for FAN_OPEN_PERM later on, this method could not
be used by userspace to test for this finer grained support.

In another thread, I mention the fact that FAN_OPEN_PERM still has a
potential freeze deadlock when called from open(O_TRUNC|O_CREATE),
so we can consider the fact that FAN_DENY_ERRNO is not allowed with
FAN_OPEN_PERM as a negative incentive for people to consider using
FAN_OPEN_PERM as a trigger for HSM.

Thoughts?

Thanks,
Amir.
Jan Kara Dec. 18, 2023, 2:35 p.m. UTC | #5
On Fri 15-12-23 18:50:39, Amir Goldstein wrote:
> On Fri, Dec 15, 2023 at 5:31 PM Josef Bacik <josef@toxicpanda.com> wrote:
> >
> > On Wed, Dec 13, 2023 at 09:09:30PM +0200, Amir Goldstein wrote:
> > > On Wed, Dec 13, 2023 at 7:28 PM Jan Kara <jack@suse.cz> wrote:
> > > >
> > > > On Fri 08-12-23 10:01:35, Amir Goldstein wrote:
> > > > > With FAN_DENY response, user trying to perform the filesystem operation
> > > > > gets an error with errno set to EPERM.
> > > > >
> > > > > It is useful for hierarchical storage management (HSM) service to be able
> > > > > to deny access for reasons more diverse than EPERM, for example EAGAIN,
> > > > > if HSM could retry the operation later.
> > > > >
> > > > > Allow userspace to response to permission events with the response value
> > > > > FAN_DENY_ERRNO(errno), instead of FAN_DENY to return a custom error.
> > > > >
> > > > > The change in fanotify_response is backward compatible, because errno is
> > > > > written in the high 8 bits of the 32bit response field and old kernels
> > > > > reject respose value with high bits set.
> > > > >
> > > > > Signed-off-by: Amir Goldstein <amir73il@gmail.com>
> > > >
> > > > So a couple of comments that spring to my mind when I'm looking into this
> > > > now (partly maybe due to my weak memory ;):
> > > >
> > > > 1) Do we still need the EAGAIN return? I think we have mostly dealt with
> > > > freezing deadlocks in another way, didn't we?
> > >
> > > I was thinking about EAGAIN on account of the HSM not being able to
> > > download the file ATM.
> > >
> > > There are a bunch of error codes that are typical for network filesystems, e.g.
> > > ETIMEDOUT, ENOTCONN, ECONNRESET which could be relevant to
> > > HSM failures.
> > >
> > > >
> > > > 2) If answer to 1) is yes, then there is a second question - do we expect
> > > > the errors to propagate back to the unsuspecting application doing say
> > > > read(2) syscall? Because I don't think that will fly well with a big
> > > > majority of applications which basically treat *any* error from read(2) as
> > > > fatal. This is also related to your question about standard permission
> > > > events. Consumers of these error numbers are going to be random
> > > > applications and I see a potential for rather big confusion arising there
> > > > (like read(1) returning EINVAL or EBADF and now you wonder why the hell
> > > > until you go debug the kernel and find out the error is coming out of
> > > > fanotify handler). And the usecase is not quite clear to me for ordinary
> > > > fanotify permission events (while I have no doubts about creativity of
> > > > implementors of fanotify handlers ;)).
> > > >
> > >
> > > That's a good question.
> > > I prefer to delegate your question to the prospect users of the feature.
> > >
> > > Josef, which errors did your use case need this feature for?
> > >
> > > > 3) Given the potential for confusion, maybe we should stay conservative and
> > > > only allow additional EAGAIN error instead of arbitrary errno if we need it?
> > > >
> > >
> > > I know I was planning to use this for EDQUOT error (from FAN_PRE_MODIFY),
> > > but I certainly wouldn't mind restricting the set of custom errors.
> > > I think it makes sense. The hard part is to agree on this set of errors.
> > >
> >
> > I'm all for flexibility here.
> >
> > We're going to have 2 classes of applications interacting with HSM backed
> > storage, normal applications and applications that know they're backed by HSM.
> > The normal applications are just going to crash if they get an error on read(2),
> > it doesn't matter what errno it is.  The second class would have different
> > things they'd want to do in the face of different errors, and that's what this
> > patchset is targeting.  We can limit it to a few errno's if that makes you feel
> > better, but having more than just one would be helpful.
> 
> Ok. In another email I got from your colleagues, they listed:
> EIO, EAGAIN, ENOSPC as possible errors to return.
> I added EDQUOT for our in house use case.

OK, so do I get it right that you also have applications that are aware
that they are operation on top of HSM managed filesystem and thus they can
do meaningful things with the reported errors?

> Those are all valid errors for write(2) and some are valid for read(2).
> ENOSPC/EDQUOT make a lot of sense for HSM for read(2), but could
> be surprising to applications not aware of HSM.
> I think it wouldn't be that bad if we let HSM decide which of those errors
> to return for FAN_PRE_ACCESS as opposed to FAN_PRE_MODIFY.

Yeah, I don't think we need to be super-restrictive here, I'd just prefer
to avoid the "whatever number you decide to return" kind of interface
because I can see potential for confusion and abuse there. I think all four
errors above are perfectly fine for both FAN_PRE_ACCESS and FAN_PRE_MODIFY
if there are consumers that are able to use them.
 
> But given that we do want to limit the chance of abusing this feature,
> perhaps it would be best to limit the error codes to known error codes
> for write(2) IO failures (i.e. not EBADF, not EFAULT) and allow returning
> FAN_DENY_ERRNO only for the new FAN_PRE_{ACCESS,MODIFY}
> HSM events.
> 
> IOW, FAN_{OPEN,ACCESS}_PERM - no FAN_DENY_ERRNO for you!
> 
> Does that sound good to you?

It sounds OK to me. I'm open to allowing FAN_DENY_ERRNO for FAN_OPEN_PERM
if there's a usecase because at least conceptually it makes a good sense
and chances for confusion are low there. People are used to dealing with
errors on open(2).

> Furthermore, we can start with allowing a very limited set of errors
> and extend it in the future, on case by case basis.
> 
> The way that this could be manageable is if we provide userspace
> a way to test for supported return codes.
> 
> There is already a simple method that we used for testing FAN_INFO
> records type support -
> After fan_fd = fanotify_init(), userspace can write a "test" fanotify_response
> to fan_fd with fanotify_response.fd=FAN_NOFD.
> 
> When setting fanotify_response.fd=FAN_DENY, this would return ENOENT,
> but with fanotify_response.fd=FAN_DENY_ERRNO(EIO), upstream would
> return EINVAL.
> 
> This opens the possibility of allowing, say, EIO, EAGAIN in the first release
> and ENOSPC, EDQUOT in the following release.

If we forsee that ENOSPC and EDQUOT will be needed, then we can just enable
it from start and not complicate our lives more than necessary.

> The advantage in this method is that it is very simple and already working
> correctly for old kernels.
> 
> The downside is that this simple method does not allow checking for
> allowed errors per specific event type, so if we decide that we do want
> to allow returning FAN_DENY_ERRNO for FAN_OPEN_PERM later on, this method
> could not be used by userspace to test for this finer grained support.

True, in that case the HSM manager would have to try responding with
FAN_DENY_ERRNO() and if it fails, it will have to fallback to responding
with FAN_DENY. Not too bad I'd say.

> In another thread, I mention the fact that FAN_OPEN_PERM still has a
> potential freeze deadlock when called from open(O_TRUNC|O_CREATE),
> so we can consider the fact that FAN_DENY_ERRNO is not allowed with
> FAN_OPEN_PERM as a negative incentive for people to consider using
> FAN_OPEN_PERM as a trigger for HSM.

AFAIU from the past discussions, there's no good use of FAN_OPEN_PERM
event for HSM. If that's the case, I'm for not allowing FAN_DENY_ERRNO for
FAN_OPEN_PERM.

								Honza
Amir Goldstein Dec. 18, 2023, 3:53 p.m. UTC | #6
On Mon, Dec 18, 2023 at 4:35 PM Jan Kara <jack@suse.cz> wrote:
>
> On Fri 15-12-23 18:50:39, Amir Goldstein wrote:
> > On Fri, Dec 15, 2023 at 5:31 PM Josef Bacik <josef@toxicpanda.com> wrote:
> > >
> > > On Wed, Dec 13, 2023 at 09:09:30PM +0200, Amir Goldstein wrote:
> > > > On Wed, Dec 13, 2023 at 7:28 PM Jan Kara <jack@suse.cz> wrote:
> > > > >
> > > > > On Fri 08-12-23 10:01:35, Amir Goldstein wrote:
> > > > > > With FAN_DENY response, user trying to perform the filesystem operation
> > > > > > gets an error with errno set to EPERM.
> > > > > >
> > > > > > It is useful for hierarchical storage management (HSM) service to be able
> > > > > > to deny access for reasons more diverse than EPERM, for example EAGAIN,
> > > > > > if HSM could retry the operation later.
> > > > > >
> > > > > > Allow userspace to response to permission events with the response value
> > > > > > FAN_DENY_ERRNO(errno), instead of FAN_DENY to return a custom error.
> > > > > >
> > > > > > The change in fanotify_response is backward compatible, because errno is
> > > > > > written in the high 8 bits of the 32bit response field and old kernels
> > > > > > reject respose value with high bits set.
> > > > > >
> > > > > > Signed-off-by: Amir Goldstein <amir73il@gmail.com>
> > > > >
> > > > > So a couple of comments that spring to my mind when I'm looking into this
> > > > > now (partly maybe due to my weak memory ;):
> > > > >
> > > > > 1) Do we still need the EAGAIN return? I think we have mostly dealt with
> > > > > freezing deadlocks in another way, didn't we?
> > > >
> > > > I was thinking about EAGAIN on account of the HSM not being able to
> > > > download the file ATM.
> > > >
> > > > There are a bunch of error codes that are typical for network filesystems, e.g.
> > > > ETIMEDOUT, ENOTCONN, ECONNRESET which could be relevant to
> > > > HSM failures.
> > > >
> > > > >
> > > > > 2) If answer to 1) is yes, then there is a second question - do we expect
> > > > > the errors to propagate back to the unsuspecting application doing say
> > > > > read(2) syscall? Because I don't think that will fly well with a big
> > > > > majority of applications which basically treat *any* error from read(2) as
> > > > > fatal. This is also related to your question about standard permission
> > > > > events. Consumers of these error numbers are going to be random
> > > > > applications and I see a potential for rather big confusion arising there
> > > > > (like read(1) returning EINVAL or EBADF and now you wonder why the hell
> > > > > until you go debug the kernel and find out the error is coming out of
> > > > > fanotify handler). And the usecase is not quite clear to me for ordinary
> > > > > fanotify permission events (while I have no doubts about creativity of
> > > > > implementors of fanotify handlers ;)).
> > > > >
> > > >
> > > > That's a good question.
> > > > I prefer to delegate your question to the prospect users of the feature.
> > > >
> > > > Josef, which errors did your use case need this feature for?
> > > >
> > > > > 3) Given the potential for confusion, maybe we should stay conservative and
> > > > > only allow additional EAGAIN error instead of arbitrary errno if we need it?
> > > > >
> > > >
> > > > I know I was planning to use this for EDQUOT error (from FAN_PRE_MODIFY),
> > > > but I certainly wouldn't mind restricting the set of custom errors.
> > > > I think it makes sense. The hard part is to agree on this set of errors.
> > > >
> > >
> > > I'm all for flexibility here.
> > >
> > > We're going to have 2 classes of applications interacting with HSM backed
> > > storage, normal applications and applications that know they're backed by HSM.
> > > The normal applications are just going to crash if they get an error on read(2),
> > > it doesn't matter what errno it is.  The second class would have different
> > > things they'd want to do in the face of different errors, and that's what this
> > > patchset is targeting.  We can limit it to a few errno's if that makes you feel
> > > better, but having more than just one would be helpful.
> >
> > Ok. In another email I got from your colleagues, they listed:
> > EIO, EAGAIN, ENOSPC as possible errors to return.
> > I added EDQUOT for our in house use case.
>
> OK, so do I get it right that you also have applications that are aware
> that they are operation on top of HSM managed filesystem and thus they can
> do meaningful things with the reported errors?
>

Some applications are HSM aware.
Some just report the errors that they get which are meaningful to users.
EIO is the standard response for HSM failure to fill content.

EDQUOT/ENOSPC is a good example of special functionality.
HSM "swaps out" file content to a slow remote tier, but the slow remote
tier may have a space/quota limit that is known to HSM.

By tracking the total of st_size under some HSM managed folder, including
the st_size of files whose content is punched out, HSM can enforce this limit
not in the conventional meaning of local disk blocks usage.

This is when returning EDQUOT/ENOSPC for FAN_PRE_MODIFY makes
sense to most users/applications, except for ones that try to create
large sparse
files...


> > Those are all valid errors for write(2) and some are valid for read(2).
> > ENOSPC/EDQUOT make a lot of sense for HSM for read(2), but could
> > be surprising to applications not aware of HSM.
> > I think it wouldn't be that bad if we let HSM decide which of those errors
> > to return for FAN_PRE_ACCESS as opposed to FAN_PRE_MODIFY.
>
> Yeah, I don't think we need to be super-restrictive here, I'd just prefer
> to avoid the "whatever number you decide to return" kind of interface
> because I can see potential for confusion and abuse there. I think all four
> errors above are perfectly fine for both FAN_PRE_ACCESS and FAN_PRE_MODIFY
> if there are consumers that are able to use them.
>
> > But given that we do want to limit the chance of abusing this feature,
> > perhaps it would be best to limit the error codes to known error codes
> > for write(2) IO failures (i.e. not EBADF, not EFAULT) and allow returning
> > FAN_DENY_ERRNO only for the new FAN_PRE_{ACCESS,MODIFY}
> > HSM events.
> >
> > IOW, FAN_{OPEN,ACCESS}_PERM - no FAN_DENY_ERRNO for you!
> >
> > Does that sound good to you?
>
> It sounds OK to me. I'm open to allowing FAN_DENY_ERRNO for FAN_OPEN_PERM
> if there's a usecase because at least conceptually it makes a good sense
> and chances for confusion are low there. People are used to dealing with
> errors on open(2).
>

I wrote about one case I have below.

> > Furthermore, we can start with allowing a very limited set of errors
> > and extend it in the future, on case by case basis.
> >
> > The way that this could be manageable is if we provide userspace
> > a way to test for supported return codes.
> >
> > There is already a simple method that we used for testing FAN_INFO
> > records type support -
> > After fan_fd = fanotify_init(), userspace can write a "test" fanotify_response
> > to fan_fd with fanotify_response.fd=FAN_NOFD.
> >
> > When setting fanotify_response.fd=FAN_DENY, this would return ENOENT,
> > but with fanotify_response.fd=FAN_DENY_ERRNO(EIO), upstream would
> > return EINVAL.
> >
> > This opens the possibility of allowing, say, EIO, EAGAIN in the first release
> > and ENOSPC, EDQUOT in the following release.
>
> If we forsee that ENOSPC and EDQUOT will be needed, then we can just enable
> it from start and not complicate our lives more than necessary.
>

Sure, I was just giving an example how the list could be extended case by case
in the future.

> > The advantage in this method is that it is very simple and already working
> > correctly for old kernels.
> >
> > The downside is that this simple method does not allow checking for
> > allowed errors per specific event type, so if we decide that we do want
> > to allow returning FAN_DENY_ERRNO for FAN_OPEN_PERM later on, this method
> > could not be used by userspace to test for this finer grained support.
>
> True, in that case the HSM manager would have to try responding with
> FAN_DENY_ERRNO() and if it fails, it will have to fallback to responding
> with FAN_DENY. Not too bad I'd say.
>

Yeah that works too.

> > In another thread, I mention the fact that FAN_OPEN_PERM still has a
> > potential freeze deadlock when called from open(O_TRUNC|O_CREATE),
> > so we can consider the fact that FAN_DENY_ERRNO is not allowed with
> > FAN_OPEN_PERM as a negative incentive for people to consider using
> > FAN_OPEN_PERM as a trigger for HSM.
>
> AFAIU from the past discussions, there's no good use of FAN_OPEN_PERM
> event for HSM. If that's the case, I'm for not allowing FAN_DENY_ERRNO for
> FAN_OPEN_PERM.

In the HttpDirFS HSM demo, I used FAN_OPEN_PERM on a mount mark
to deny open of file during the short time that it's content is being
punched out [1].
It is quite complicated to explain, but I only used it for denying access,
not to fill content and not to write anything to filesystem.
It's worth noting that returning EBUSY in that case would be more meaningful
to users.

That's one case in favor of allowing FAN_DENY_ERRNO for FAN_OPEN_PERM,
but mainly I do not have a proof that people will not need it.

OTOH, I am a bit concerned that this will encourage developer to use
FAN_OPEN_PERM as a trigger to filling file content and then we are back to
deadlock risk zone.

Not sure which way to go.

Anyway, I think we agree that there is no reason to merge FAN_DENY_ERRNO
before FAN_PRE_* events, so we can continue this discussion later when
I post FAN_PRE_* patches - not for this cycle.

Thanks,
Amir.

[1] https://github.com/amir73il/fsnotify-utils/wiki/Hierarchical-Storage-Management-API#invalidating-local-cache
Amir Goldstein Jan. 29, 2024, 6:30 p.m. UTC | #7
On Mon, Dec 18, 2023 at 5:53 PM Amir Goldstein <amir73il@gmail.com> wrote:
>
> On Mon, Dec 18, 2023 at 4:35 PM Jan Kara <jack@suse.cz> wrote:
> >
> > On Fri 15-12-23 18:50:39, Amir Goldstein wrote:
> > > On Fri, Dec 15, 2023 at 5:31 PM Josef Bacik <josef@toxicpanda.com> wrote:
> > > >
> > > > On Wed, Dec 13, 2023 at 09:09:30PM +0200, Amir Goldstein wrote:
> > > > > On Wed, Dec 13, 2023 at 7:28 PM Jan Kara <jack@suse.cz> wrote:
> > > > > >
> > > > > > On Fri 08-12-23 10:01:35, Amir Goldstein wrote:
> > > > > > > With FAN_DENY response, user trying to perform the filesystem operation
> > > > > > > gets an error with errno set to EPERM.
> > > > > > >
> > > > > > > It is useful for hierarchical storage management (HSM) service to be able
> > > > > > > to deny access for reasons more diverse than EPERM, for example EAGAIN,
> > > > > > > if HSM could retry the operation later.
> > > > > > >
> > > > > > > Allow userspace to response to permission events with the response value
> > > > > > > FAN_DENY_ERRNO(errno), instead of FAN_DENY to return a custom error.
> > > > > > >
> > > > > > > The change in fanotify_response is backward compatible, because errno is
> > > > > > > written in the high 8 bits of the 32bit response field and old kernels
> > > > > > > reject respose value with high bits set.
> > > > > > >
> > > > > > > Signed-off-by: Amir Goldstein <amir73il@gmail.com>
> > > > > >
> > > > > > So a couple of comments that spring to my mind when I'm looking into this
> > > > > > now (partly maybe due to my weak memory ;):
> > > > > >
> > > > > > 1) Do we still need the EAGAIN return? I think we have mostly dealt with
> > > > > > freezing deadlocks in another way, didn't we?
> > > > >
> > > > > I was thinking about EAGAIN on account of the HSM not being able to
> > > > > download the file ATM.
> > > > >
> > > > > There are a bunch of error codes that are typical for network filesystems, e.g.
> > > > > ETIMEDOUT, ENOTCONN, ECONNRESET which could be relevant to
> > > > > HSM failures.
> > > > >
> > > > > >
> > > > > > 2) If answer to 1) is yes, then there is a second question - do we expect
> > > > > > the errors to propagate back to the unsuspecting application doing say
> > > > > > read(2) syscall? Because I don't think that will fly well with a big
> > > > > > majority of applications which basically treat *any* error from read(2) as
> > > > > > fatal. This is also related to your question about standard permission
> > > > > > events. Consumers of these error numbers are going to be random
> > > > > > applications and I see a potential for rather big confusion arising there
> > > > > > (like read(1) returning EINVAL or EBADF and now you wonder why the hell
> > > > > > until you go debug the kernel and find out the error is coming out of
> > > > > > fanotify handler). And the usecase is not quite clear to me for ordinary
> > > > > > fanotify permission events (while I have no doubts about creativity of
> > > > > > implementors of fanotify handlers ;)).
> > > > > >
> > > > >
> > > > > That's a good question.
> > > > > I prefer to delegate your question to the prospect users of the feature.
> > > > >
> > > > > Josef, which errors did your use case need this feature for?
> > > > >
> > > > > > 3) Given the potential for confusion, maybe we should stay conservative and
> > > > > > only allow additional EAGAIN error instead of arbitrary errno if we need it?
> > > > > >
> > > > >
> > > > > I know I was planning to use this for EDQUOT error (from FAN_PRE_MODIFY),
> > > > > but I certainly wouldn't mind restricting the set of custom errors.
> > > > > I think it makes sense. The hard part is to agree on this set of errors.
> > > > >
> > > >
> > > > I'm all for flexibility here.
> > > >
> > > > We're going to have 2 classes of applications interacting with HSM backed
> > > > storage, normal applications and applications that know they're backed by HSM.
> > > > The normal applications are just going to crash if they get an error on read(2),
> > > > it doesn't matter what errno it is.  The second class would have different
> > > > things they'd want to do in the face of different errors, and that's what this
> > > > patchset is targeting.  We can limit it to a few errno's if that makes you feel
> > > > better, but having more than just one would be helpful.
> > >
> > > Ok. In another email I got from your colleagues, they listed:
> > > EIO, EAGAIN, ENOSPC as possible errors to return.
> > > I added EDQUOT for our in house use case.
> >
> > OK, so do I get it right that you also have applications that are aware
> > that they are operation on top of HSM managed filesystem and thus they can
> > do meaningful things with the reported errors?
> >
>
> Some applications are HSM aware.
> Some just report the errors that they get which are meaningful to users.
> EIO is the standard response for HSM failure to fill content.
>
> EDQUOT/ENOSPC is a good example of special functionality.
> HSM "swaps out" file content to a slow remote tier, but the slow remote
> tier may have a space/quota limit that is known to HSM.
>
> By tracking the total of st_size under some HSM managed folder, including
> the st_size of files whose content is punched out, HSM can enforce this limit
> not in the conventional meaning of local disk blocks usage.
>
> This is when returning EDQUOT/ENOSPC for FAN_PRE_MODIFY makes
> sense to most users/applications, except for ones that try to create
> large sparse
> files...
>
>
> > > Those are all valid errors for write(2) and some are valid for read(2).
> > > ENOSPC/EDQUOT make a lot of sense for HSM for read(2), but could
> > > be surprising to applications not aware of HSM.
> > > I think it wouldn't be that bad if we let HSM decide which of those errors
> > > to return for FAN_PRE_ACCESS as opposed to FAN_PRE_MODIFY.
> >
> > Yeah, I don't think we need to be super-restrictive here, I'd just prefer
> > to avoid the "whatever number you decide to return" kind of interface
> > because I can see potential for confusion and abuse there. I think all four
> > errors above are perfectly fine for both FAN_PRE_ACCESS and FAN_PRE_MODIFY
> > if there are consumers that are able to use them.
> >
> > > But given that we do want to limit the chance of abusing this feature,
> > > perhaps it would be best to limit the error codes to known error codes
> > > for write(2) IO failures (i.e. not EBADF, not EFAULT) and allow returning
> > > FAN_DENY_ERRNO only for the new FAN_PRE_{ACCESS,MODIFY}
> > > HSM events.
> > >
> > > IOW, FAN_{OPEN,ACCESS}_PERM - no FAN_DENY_ERRNO for you!
> > >
> > > Does that sound good to you?
> >
> > It sounds OK to me. I'm open to allowing FAN_DENY_ERRNO for FAN_OPEN_PERM
> > if there's a usecase because at least conceptually it makes a good sense
> > and chances for confusion are low there. People are used to dealing with
> > errors on open(2).
> >
>
> I wrote about one case I have below.
>
> > > Furthermore, we can start with allowing a very limited set of errors
> > > and extend it in the future, on case by case basis.
> > >
> > > The way that this could be manageable is if we provide userspace
> > > a way to test for supported return codes.
> > >
> > > There is already a simple method that we used for testing FAN_INFO
> > > records type support -
> > > After fan_fd = fanotify_init(), userspace can write a "test" fanotify_response
> > > to fan_fd with fanotify_response.fd=FAN_NOFD.
> > >
> > > When setting fanotify_response.fd=FAN_DENY, this would return ENOENT,
> > > but with fanotify_response.fd=FAN_DENY_ERRNO(EIO), upstream would
> > > return EINVAL.
> > >
> > > This opens the possibility of allowing, say, EIO, EAGAIN in the first release
> > > and ENOSPC, EDQUOT in the following release.
> >
> > If we forsee that ENOSPC and EDQUOT will be needed, then we can just enable
> > it from start and not complicate our lives more than necessary.
> >
>
> Sure, I was just giving an example how the list could be extended case by case
> in the future.
>
> > > The advantage in this method is that it is very simple and already working
> > > correctly for old kernels.
> > >
> > > The downside is that this simple method does not allow checking for
> > > allowed errors per specific event type, so if we decide that we do want
> > > to allow returning FAN_DENY_ERRNO for FAN_OPEN_PERM later on, this method
> > > could not be used by userspace to test for this finer grained support.
> >
> > True, in that case the HSM manager would have to try responding with
> > FAN_DENY_ERRNO() and if it fails, it will have to fallback to responding
> > with FAN_DENY. Not too bad I'd say.
> >
>
> Yeah that works too.
>
> > > In another thread, I mention the fact that FAN_OPEN_PERM still has a
> > > potential freeze deadlock when called from open(O_TRUNC|O_CREATE),
> > > so we can consider the fact that FAN_DENY_ERRNO is not allowed with
> > > FAN_OPEN_PERM as a negative incentive for people to consider using
> > > FAN_OPEN_PERM as a trigger for HSM.
> >
> > AFAIU from the past discussions, there's no good use of FAN_OPEN_PERM
> > event for HSM. If that's the case, I'm for not allowing FAN_DENY_ERRNO for
> > FAN_OPEN_PERM.
>
> In the HttpDirFS HSM demo, I used FAN_OPEN_PERM on a mount mark
> to deny open of file during the short time that it's content is being
> punched out [1].
> It is quite complicated to explain, but I only used it for denying access,
> not to fill content and not to write anything to filesystem.
> It's worth noting that returning EBUSY in that case would be more meaningful
> to users.
>
> That's one case in favor of allowing FAN_DENY_ERRNO for FAN_OPEN_PERM,
> but mainly I do not have a proof that people will not need it.
>
> OTOH, I am a bit concerned that this will encourage developer to use
> FAN_OPEN_PERM as a trigger to filling file content and then we are back to
> deadlock risk zone.
>
> Not sure which way to go.
>
> Anyway, I think we agree that there is no reason to merge FAN_DENY_ERRNO
> before FAN_PRE_* events, so we can continue this discussion later when
> I post FAN_PRE_* patches - not for this cycle.
>

Hi Jan,

I started to prepare the pre-content events patches for posting and got back
to this one as well.

Since we had this discussion I have learned of another use case that requires
filling file content in FAN_OPEN_PERM hook, FAN_OPEN_EXEC_PERM to
be exact.

The reason is that unless an executable content is filled at execve() time,
there is no other opportunity to fill its content without getting -ETXTBSY.

So to keep things more flexible, I decided to add -ETXTBSY to the
allowed errors with FAN_DENY_ERRNO() and to decided to allow
FAN_DENY_ERRNO() with all permission events.

To keep FAN_DENY_ERRNO() a bit more focused on HSM, I have
added a limitation that FAN_DENY_ERRNO() is allowed only for
FAN_CLASS_PRE_CONTENT groups.

Thanks,
Amir.
Jan Kara Feb. 5, 2024, 6:27 p.m. UTC | #8
I'm sorry for the delay. The last week was busy and this fell through the
cracks.

On Mon 29-01-24 20:30:34, Amir Goldstein wrote:
> On Mon, Dec 18, 2023 at 5:53 PM Amir Goldstein <amir73il@gmail.com> wrote:
> > In the HttpDirFS HSM demo, I used FAN_OPEN_PERM on a mount mark
> > to deny open of file during the short time that it's content is being
> > punched out [1].
> > It is quite complicated to explain, but I only used it for denying access,
> > not to fill content and not to write anything to filesystem.
> > It's worth noting that returning EBUSY in that case would be more meaningful
> > to users.
> >
> > That's one case in favor of allowing FAN_DENY_ERRNO for FAN_OPEN_PERM,
> > but mainly I do not have a proof that people will not need it.
> >
> > OTOH, I am a bit concerned that this will encourage developer to use
> > FAN_OPEN_PERM as a trigger to filling file content and then we are back to
> > deadlock risk zone.
> >
> > Not sure which way to go.
> >
> > Anyway, I think we agree that there is no reason to merge FAN_DENY_ERRNO
> > before FAN_PRE_* events, so we can continue this discussion later when
> > I post FAN_PRE_* patches - not for this cycle.
> 
> I started to prepare the pre-content events patches for posting and got back
> to this one as well.
> 
> Since we had this discussion I have learned of another use case that
> requires filling file content in FAN_OPEN_PERM hook, FAN_OPEN_EXEC_PERM
> to be exact.
>
> The reason is that unless an executable content is filled at execve() time,
> there is no other opportunity to fill its content without getting -ETXTBSY.

Yes, I've been scratching my head over this usecase for a few days. I was
thinking whether we could somehow fill in executable (and executed) files on
access but it all seemed too hacky so I agree that we probably have to fill
them in on open. 

> So to keep things more flexible, I decided to add -ETXTBSY to the
> allowed errors with FAN_DENY_ERRNO() and to decided to allow
> FAN_DENY_ERRNO() with all permission events.
>
> To keep FAN_DENY_ERRNO() a bit more focused on HSM, I have
> added a limitation that FAN_DENY_ERRNO() is allowed only for
> FAN_CLASS_PRE_CONTENT groups.

I have no problem with adding -ETXTBSY to the set of allowed errors. That
makes sense. Adding FAN_DENY_ERRNO() to all permission events in
FAN_CLASS_PRE_CONTENT groups - OK, if we don't find anything better - I
wanted to hash out another possibility here: Currently all permission
events (and thus also the events we plan to use for HSM AFAIU) are using
'fd' to identify file where the event happened. This is used as identifier
for response, can be used to fill in file contents for HSM but it also
causes issues such as the problem with exec(2) occasionally failing if this
fd happens to get closed only after exec(2) gets to checking
deny_write_access(). So what if we implemented events needed for HSM as FID
events (we'd have think how to match replies to events)? Then the app would
open the file for filling in using FID as well as it would naturally close
the handle before replying so problems with exec(2) would not arise. These
would be essentially new events (so far we didn't allow permission events
in FID groups) so allowing FAN_DENY_ERRNO() replies for them would be
natural. Overall it would seem like a cleaner "clean room implementation"
API?

								Honza
Amir Goldstein Feb. 6, 2024, 4:35 p.m. UTC | #9
On Mon, Feb 5, 2024 at 8:27 PM Jan Kara <jack@suse.cz> wrote:
>
> I'm sorry for the delay. The last week was busy and this fell through the
> cracks.
>

No worries, I was busy as well.
I did already rebase fan_pre_content & fan_errno [1] (over v6.8-rc2)
last week, made the small changes I mention here and ran some
basic tests, but did not complete writing tests.
Hoping to switch back to it this week.

[1] https://github.com/amir73il/linux/commits/fan_errno

> On Mon 29-01-24 20:30:34, Amir Goldstein wrote:
> > On Mon, Dec 18, 2023 at 5:53 PM Amir Goldstein <amir73il@gmail.com> wrote:
> > > In the HttpDirFS HSM demo, I used FAN_OPEN_PERM on a mount mark
> > > to deny open of file during the short time that it's content is being
> > > punched out [1].
> > > It is quite complicated to explain, but I only used it for denying access,
> > > not to fill content and not to write anything to filesystem.
> > > It's worth noting that returning EBUSY in that case would be more meaningful
> > > to users.
> > >
> > > That's one case in favor of allowing FAN_DENY_ERRNO for FAN_OPEN_PERM,
> > > but mainly I do not have a proof that people will not need it.
> > >
> > > OTOH, I am a bit concerned that this will encourage developer to use
> > > FAN_OPEN_PERM as a trigger to filling file content and then we are back to
> > > deadlock risk zone.
> > >
> > > Not sure which way to go.
> > >
> > > Anyway, I think we agree that there is no reason to merge FAN_DENY_ERRNO
> > > before FAN_PRE_* events, so we can continue this discussion later when
> > > I post FAN_PRE_* patches - not for this cycle.
> >
> > I started to prepare the pre-content events patches for posting and got back
> > to this one as well.
> >
> > Since we had this discussion I have learned of another use case that
> > requires filling file content in FAN_OPEN_PERM hook, FAN_OPEN_EXEC_PERM
> > to be exact.
> >
> > The reason is that unless an executable content is filled at execve() time,
> > there is no other opportunity to fill its content without getting -ETXTBSY.
>
> Yes, I've been scratching my head over this usecase for a few days. I was
> thinking whether we could somehow fill in executable (and executed) files on
> access but it all seemed too hacky so I agree that we probably have to fill
> them in on open.
>

Normally, I think there will not be a really huge executable(?)
If there were huge executables, they would have likely been broken down
into smaller loadable libraries which should allow more granular
content filling,
but I guess there will always be worst case exceptions.

> > So to keep things more flexible, I decided to add -ETXTBSY to the
> > allowed errors with FAN_DENY_ERRNO() and to decided to allow
> > FAN_DENY_ERRNO() with all permission events.
> >
> > To keep FAN_DENY_ERRNO() a bit more focused on HSM, I have
> > added a limitation that FAN_DENY_ERRNO() is allowed only for
> > FAN_CLASS_PRE_CONTENT groups.
>
> I have no problem with adding -ETXTBSY to the set of allowed errors. That
> makes sense. Adding FAN_DENY_ERRNO() to all permission events in
> FAN_CLASS_PRE_CONTENT groups - OK,

done that.

I am still not very happy about FAN_OPEN_PERM being part of HSM
event family when I know that O_TRUCT and O_CREAT call this hook
with sb writers held.

The irony, is that there is no chance that O_TRUNC will require filling
content, same if the file is actually being created by O_CREAT, so the
cases where sb writers is actually needed and the case where content
filling is needed do not overlap, but I cannot figure out how to get those
cases out of the HSM risk zone. Ideas?

> if we don't find anything better - I
> wanted to hash out another possibility here: Currently all permission
> events (and thus also the events we plan to use for HSM AFAIU) are using
> 'fd' to identify file where the event happened. This is used as identifier
> for response, can be used to fill in file contents for HSM but it also
> causes issues such as the problem with exec(2) occasionally failing if this
> fd happens to get closed only after exec(2) gets to checking
> deny_write_access(). So what if we implemented events needed for HSM as FID
> events (we'd have think how to match replies to events)? Then the app would
> open the file for filling in using FID as well as it would naturally close
> the handle before replying so problems with exec(2) would not arise. These

The two things are independent IMO.
We can use an event->key instead of event->fd, which I like,
but we may still leave event->fd as a way to get an FMODE_NONOTIFY
fd as long as the user closes event->fd before responding or we can
implement Sargun's suggestion of the FAN_CLOSE_FD response flag.

If a user needs to open an FMODE_NONOTIFY fd from fid, we will
need to provide a way to do that.
My WIP pre-lookup event patches [2] implements inheritance of
FMODE_NONOTIFY from dirfd used for openat().
Perhaps we can do the same for open_by_handle_at() and inherit
FMODE_NONOTIFY from mount_fd to implement your suggestion?

[2] https://github.com/amir73il/linux/commits/fan_lookup_perm

> would be essentially new events (so far we didn't allow permission events
> in FID groups) so allowing FAN_DENY_ERRNO() replies for them would be
> natural. Overall it would seem like a cleaner "clean room implementation"
> API?

I like the idea of a clean slate.

Looking a head, for the PRE_PATH events (e.g. lookup,create)
I was planning to use FAN_EVENT_INFO_TYPE_DFID_NAME
to carry the last component lookup name, but then also have
event->fd as the dirfd of lookup/create.

That's a bit ugly duplicity and also it does not cover rename(), because
if we use FAN_EVENT_INFO_TYPE_{OLD,NEW}_DFID_NAME
to report names, where would newdirfd, olddirfd be reported?

Your suggestion solves both these questions elegantly and
if you agree to adapting open_by_handle_at() to cater fanotify needs,
then we have a plan to propose.

The bad side of clean slate is that it reduces the chances of me
being able to get pre-content events ready in time for 6.9, which
is a shame, but we got to do what we got to do ;)

Thanks,
Amir.
Amir Goldstein Feb. 8, 2024, 2:04 p.m. UTC | #10
> > On Mon 29-01-24 20:30:34, Amir Goldstein wrote:
> > > On Mon, Dec 18, 2023 at 5:53 PM Amir Goldstein <amir73il@gmail.com> wrote:
> > > > In the HttpDirFS HSM demo, I used FAN_OPEN_PERM on a mount mark
> > > > to deny open of file during the short time that it's content is being
> > > > punched out.
> > > > It is quite complicated to explain, but I only used it for denying access,
> > > > not to fill content and not to write anything to filesystem.
> > > > It's worth noting that returning EBUSY in that case would be more meaningful
> > > > to users.
> > > >
> > > > That's one case in favor of allowing FAN_DENY_ERRNO for FAN_OPEN_PERM,
> > > > but mainly I do not have a proof that people will not need it.
> > > >
> > > > OTOH, I am a bit concerned that this will encourage developer to use
> > > > FAN_OPEN_PERM as a trigger to filling file content and then we are back to
> > > > deadlock risk zone.
> > > >
> > > > Not sure which way to go.
> > > >
> > > > Anyway, I think we agree that there is no reason to merge FAN_DENY_ERRNO
> > > > before FAN_PRE_* events, so we can continue this discussion later when
> > > > I post FAN_PRE_* patches - not for this cycle.
> > >
> > > I started to prepare the pre-content events patches for posting and got back
> > > to this one as well.
> > >
> > > Since we had this discussion I have learned of another use case that
> > > requires filling file content in FAN_OPEN_PERM hook, FAN_OPEN_EXEC_PERM
> > > to be exact.
> > >
> > > The reason is that unless an executable content is filled at execve() time,
> > > there is no other opportunity to fill its content without getting -ETXTBSY.
> >
> > Yes, I've been scratching my head over this usecase for a few days. I was
> > thinking whether we could somehow fill in executable (and executed) files on
> > access but it all seemed too hacky so I agree that we probably have to fill
> > them in on open.
> >
>
> Normally, I think there will not be a really huge executable(?)
> If there were huge executables, they would have likely been broken down
> into smaller loadable libraries which should allow more granular
> content filling,
> but I guess there will always be worst case exceptions.
>
> > > So to keep things more flexible, I decided to add -ETXTBSY to the
> > > allowed errors with FAN_DENY_ERRNO() and to decided to allow
> > > FAN_DENY_ERRNO() with all permission events.
> > >
> > > To keep FAN_DENY_ERRNO() a bit more focused on HSM, I have
> > > added a limitation that FAN_DENY_ERRNO() is allowed only for
> > > FAN_CLASS_PRE_CONTENT groups.
> >
> > I have no problem with adding -ETXTBSY to the set of allowed errors. That
> > makes sense. Adding FAN_DENY_ERRNO() to all permission events in
> > FAN_CLASS_PRE_CONTENT groups - OK,
>
> done that.
>
> I am still not very happy about FAN_OPEN_PERM being part of HSM
> event family when I know that O_TRUCT and O_CREAT call this hook
> with sb writers held.
>
> The irony, is that there is no chance that O_TRUNC will require filling
> content, same if the file is actually being created by O_CREAT, so the
> cases where sb writers is actually needed and the case where content
> filling is needed do not overlap, but I cannot figure out how to get those
> cases out of the HSM risk zone. Ideas?
>

Jan,

I wanted to run an idea by you.

I like your idea to start a clean slate with
FAN_CLASS_PRE_CONTENT | FAN_REPORT_FID
and it would be nice if we could restrict this HSM to use
pre-content events, which is why I was not happy about allowing
FAN_DENY_ERRNO() for the legacy FAN_OPEN*_PERM events,
especially with the known deadlocks.

Since we already know that we need to generate
FAN_PRE_ACCESS(offset,length) for read-only mmap() and
FAN_PRE_MODIFY(offset,length) for writable mmap(),
we could treat uselib() and execve() the same way and generate
FAN_PRE_ACCESS(0,i_size) as if the file was mmaped.

My idea is to generate an event FAN_PRE_MODIFY(0,0)
for an open for write *after* file was truncated and
FAN_PRE_ACCESS(0,0) for open O_RDONLY.
Possibly also FAN_PRE_*(offset,0) events for llseek().

I've already pushed a POC to fan_pre_content branch [1].
Only sanity tested that nothing else is broken.
I still need to add the mmap hooks and test the new events.

With this, HSM will have appropriate hooks to fill executable
and library on first access and also fill arbitrary files on open
including the knowledge if the file was opened for write.

Thoughts?

Thanks,
Amir.

[1] https://github.com/amir73il/linux/commits/fan_pre_content
Jan Kara Feb. 8, 2024, 6:31 p.m. UTC | #11
On Thu 08-02-24 16:04:29, Amir Goldstein wrote:
> > > On Mon 29-01-24 20:30:34, Amir Goldstein wrote:
> > > > On Mon, Dec 18, 2023 at 5:53 PM Amir Goldstein <amir73il@gmail.com> wrote:
> > > > > In the HttpDirFS HSM demo, I used FAN_OPEN_PERM on a mount mark
> > > > > to deny open of file during the short time that it's content is being
> > > > > punched out.
> > > > > It is quite complicated to explain, but I only used it for denying access,
> > > > > not to fill content and not to write anything to filesystem.
> > > > > It's worth noting that returning EBUSY in that case would be more meaningful
> > > > > to users.
> > > > >
> > > > > That's one case in favor of allowing FAN_DENY_ERRNO for FAN_OPEN_PERM,
> > > > > but mainly I do not have a proof that people will not need it.
> > > > >
> > > > > OTOH, I am a bit concerned that this will encourage developer to use
> > > > > FAN_OPEN_PERM as a trigger to filling file content and then we are back to
> > > > > deadlock risk zone.
> > > > >
> > > > > Not sure which way to go.
> > > > >
> > > > > Anyway, I think we agree that there is no reason to merge FAN_DENY_ERRNO
> > > > > before FAN_PRE_* events, so we can continue this discussion later when
> > > > > I post FAN_PRE_* patches - not for this cycle.
> > > >
> > > > I started to prepare the pre-content events patches for posting and got back
> > > > to this one as well.
> > > >
> > > > Since we had this discussion I have learned of another use case that
> > > > requires filling file content in FAN_OPEN_PERM hook, FAN_OPEN_EXEC_PERM
> > > > to be exact.
> > > >
> > > > The reason is that unless an executable content is filled at execve() time,
> > > > there is no other opportunity to fill its content without getting -ETXTBSY.
> > >
> > > Yes, I've been scratching my head over this usecase for a few days. I was
> > > thinking whether we could somehow fill in executable (and executed) files on
> > > access but it all seemed too hacky so I agree that we probably have to fill
> > > them in on open.
> > >
> >
> > Normally, I think there will not be a really huge executable(?)
> > If there were huge executables, they would have likely been broken down
> > into smaller loadable libraries which should allow more granular
> > content filling,
> > but I guess there will always be worst case exceptions.
> >
> > > > So to keep things more flexible, I decided to add -ETXTBSY to the
> > > > allowed errors with FAN_DENY_ERRNO() and to decided to allow
> > > > FAN_DENY_ERRNO() with all permission events.
> > > >
> > > > To keep FAN_DENY_ERRNO() a bit more focused on HSM, I have
> > > > added a limitation that FAN_DENY_ERRNO() is allowed only for
> > > > FAN_CLASS_PRE_CONTENT groups.
> > >
> > > I have no problem with adding -ETXTBSY to the set of allowed errors. That
> > > makes sense. Adding FAN_DENY_ERRNO() to all permission events in
> > > FAN_CLASS_PRE_CONTENT groups - OK,
> >
> > done that.
> >
> > I am still not very happy about FAN_OPEN_PERM being part of HSM
> > event family when I know that O_TRUCT and O_CREAT call this hook
> > with sb writers held.
> >
> > The irony, is that there is no chance that O_TRUNC will require filling
> > content, same if the file is actually being created by O_CREAT, so the
> > cases where sb writers is actually needed and the case where content
> > filling is needed do not overlap, but I cannot figure out how to get those
> > cases out of the HSM risk zone. Ideas?
> >
> 
> Jan,
> 
> I wanted to run an idea by you.
> 
> I like your idea to start a clean slate with
> FAN_CLASS_PRE_CONTENT | FAN_REPORT_FID
> and it would be nice if we could restrict this HSM to use
> pre-content events, which is why I was not happy about allowing
> FAN_DENY_ERRNO() for the legacy FAN_OPEN*_PERM events,
> especially with the known deadlocks.
> 
> Since we already know that we need to generate
> FAN_PRE_ACCESS(offset,length) for read-only mmap() and
> FAN_PRE_MODIFY(offset,length) for writable mmap(),
> we could treat uselib() and execve() the same way and generate
> FAN_PRE_ACCESS(0,i_size) as if the file was mmaped.

BTW uselib() is deprecated and there is a patch queued to not generate
OPEN_EXEC events for it because it was causing problems (not the generation
of events itself but the FMODE_EXEC bit being set in uselib). So I don't
think we need to instrument uselib().

> My idea is to generate an event FAN_PRE_MODIFY(0,0)
> for an open for write *after* file was truncated and
> FAN_PRE_ACCESS(0,0) for open O_RDONLY.

What I find somewhat strange about this is that if we return error from the
fsnotify_file_perm() hook, open(2) will fail with error but the file is
already truncated. But I guess it should be rare and it's bearable.

> Possibly also FAN_PRE_*(offset,0) events for llseek().

That seem overdoing it a bit IMO :)

> I've already pushed a POC to fan_pre_content branch [1].
> Only sanity tested that nothing else is broken.
> I still need to add the mmap hooks and test the new events.
> 
> With this, HSM will have appropriate hooks to fill executable
> and library on first access and also fill arbitrary files on open
> including the knowledge if the file was opened for write.
> 
> Thoughts?

Yeah, I guess this looks sensible.

								Honza
Amir Goldstein Feb. 8, 2024, 7:21 p.m. UTC | #12
On Thu, Feb 8, 2024 at 8:31 PM Jan Kara <jack@suse.cz> wrote:
>
> On Thu 08-02-24 16:04:29, Amir Goldstein wrote:
> > > > On Mon 29-01-24 20:30:34, Amir Goldstein wrote:
> > > > > On Mon, Dec 18, 2023 at 5:53 PM Amir Goldstein <amir73il@gmail.com> wrote:
> > > > > > In the HttpDirFS HSM demo, I used FAN_OPEN_PERM on a mount mark
> > > > > > to deny open of file during the short time that it's content is being
> > > > > > punched out.
> > > > > > It is quite complicated to explain, but I only used it for denying access,
> > > > > > not to fill content and not to write anything to filesystem.
> > > > > > It's worth noting that returning EBUSY in that case would be more meaningful
> > > > > > to users.
> > > > > >
> > > > > > That's one case in favor of allowing FAN_DENY_ERRNO for FAN_OPEN_PERM,
> > > > > > but mainly I do not have a proof that people will not need it.
> > > > > >
> > > > > > OTOH, I am a bit concerned that this will encourage developer to use
> > > > > > FAN_OPEN_PERM as a trigger to filling file content and then we are back to
> > > > > > deadlock risk zone.
> > > > > >
> > > > > > Not sure which way to go.
> > > > > >
> > > > > > Anyway, I think we agree that there is no reason to merge FAN_DENY_ERRNO
> > > > > > before FAN_PRE_* events, so we can continue this discussion later when
> > > > > > I post FAN_PRE_* patches - not for this cycle.
> > > > >
> > > > > I started to prepare the pre-content events patches for posting and got back
> > > > > to this one as well.
> > > > >
> > > > > Since we had this discussion I have learned of another use case that
> > > > > requires filling file content in FAN_OPEN_PERM hook, FAN_OPEN_EXEC_PERM
> > > > > to be exact.
> > > > >
> > > > > The reason is that unless an executable content is filled at execve() time,
> > > > > there is no other opportunity to fill its content without getting -ETXTBSY.
> > > >
> > > > Yes, I've been scratching my head over this usecase for a few days. I was
> > > > thinking whether we could somehow fill in executable (and executed) files on
> > > > access but it all seemed too hacky so I agree that we probably have to fill
> > > > them in on open.
> > > >
> > >
> > > Normally, I think there will not be a really huge executable(?)
> > > If there were huge executables, they would have likely been broken down
> > > into smaller loadable libraries which should allow more granular
> > > content filling,
> > > but I guess there will always be worst case exceptions.
> > >
> > > > > So to keep things more flexible, I decided to add -ETXTBSY to the
> > > > > allowed errors with FAN_DENY_ERRNO() and to decided to allow
> > > > > FAN_DENY_ERRNO() with all permission events.
> > > > >
> > > > > To keep FAN_DENY_ERRNO() a bit more focused on HSM, I have
> > > > > added a limitation that FAN_DENY_ERRNO() is allowed only for
> > > > > FAN_CLASS_PRE_CONTENT groups.
> > > >
> > > > I have no problem with adding -ETXTBSY to the set of allowed errors. That
> > > > makes sense. Adding FAN_DENY_ERRNO() to all permission events in
> > > > FAN_CLASS_PRE_CONTENT groups - OK,
> > >
> > > done that.
> > >
> > > I am still not very happy about FAN_OPEN_PERM being part of HSM
> > > event family when I know that O_TRUCT and O_CREAT call this hook
> > > with sb writers held.
> > >
> > > The irony, is that there is no chance that O_TRUNC will require filling
> > > content, same if the file is actually being created by O_CREAT, so the
> > > cases where sb writers is actually needed and the case where content
> > > filling is needed do not overlap, but I cannot figure out how to get those
> > > cases out of the HSM risk zone. Ideas?
> > >
> >
> > Jan,
> >
> > I wanted to run an idea by you.
> >
> > I like your idea to start a clean slate with
> > FAN_CLASS_PRE_CONTENT | FAN_REPORT_FID
> > and it would be nice if we could restrict this HSM to use
> > pre-content events, which is why I was not happy about allowing
> > FAN_DENY_ERRNO() for the legacy FAN_OPEN*_PERM events,
> > especially with the known deadlocks.
> >
> > Since we already know that we need to generate
> > FAN_PRE_ACCESS(offset,length) for read-only mmap() and
> > FAN_PRE_MODIFY(offset,length) for writable mmap(),
> > we could treat uselib() and execve() the same way and generate
> > FAN_PRE_ACCESS(0,i_size) as if the file was mmaped.
>
> BTW uselib() is deprecated and there is a patch queued to not generate
> OPEN_EXEC events for it because it was causing problems (not the generation
> of events itself but the FMODE_EXEC bit being set in uselib). So I don't
> think we need to instrument uselib().
>

Great. The fewer the better :)

BTW, for mmap, I was thinking of adding fsnotify_file_perm() next to
call sites of security_mmap_file(), but I see that:
1. shmat() has security_mmap_file() - is it relevant?
2. remap_file_pages() calls do_mmap() without security_mmap_file() -
    do we need to cover it?

> > My idea is to generate an event FAN_PRE_MODIFY(0,0)
> > for an open for write *after* file was truncated and
> > FAN_PRE_ACCESS(0,0) for open O_RDONLY.
>
> What I find somewhat strange about this is that if we return error from the
> fsnotify_file_perm() hook, open(2) will fail with error but the file is
> already truncated. But I guess it should be rare and it's bearable.
>
> > Possibly also FAN_PRE_*(offset,0) events for llseek().
>
> That seem overdoing it a bit IMO :)

Heh! forget I said it ;)

>
> > I've already pushed a POC to fan_pre_content branch [1].
> > Only sanity tested that nothing else is broken.
> > I still need to add the mmap hooks and test the new events.
> >
> > With this, HSM will have appropriate hooks to fill executable
> > and library on first access and also fill arbitrary files on open
> > including the knowledge if the file was opened for write.
> >
> > Thoughts?
>
> Yeah, I guess this looks sensible.
>

Cool, so let's see, what is left to do for the plan of
FAN_CLASS_PRE_CONTENT | FAN_REPORT_FID?

1. event->fd is O_PATH mount_fd for open_by_handle_at()
2. open_by_handle_at() inherits FMODE_NONOTIFY from mount_fd
3. either implement the FAN_CLOSE_FD response flag (easy?) and/or
    implement FAN_REPORT_EVENT_ID and new header format

Anything else?
Are you ok with 1 and 2?
Do you have a preference for 3?

Thanks,
Amir.
Jan Kara Feb. 12, 2024, 12:01 p.m. UTC | #13
On Thu 08-02-24 21:21:13, Amir Goldstein wrote:
> On Thu, Feb 8, 2024 at 8:31 PM Jan Kara <jack@suse.cz> wrote:
> > On Thu 08-02-24 16:04:29, Amir Goldstein wrote:
> > > > > On Mon 29-01-24 20:30:34, Amir Goldstein wrote:
> > > > > > On Mon, Dec 18, 2023 at 5:53 PM Amir Goldstein <amir73il@gmail.com> wrote:
> > > > > > > In the HttpDirFS HSM demo, I used FAN_OPEN_PERM on a mount mark
> > > > > > > to deny open of file during the short time that it's content is being
> > > > > > > punched out.
> > > > > > > It is quite complicated to explain, but I only used it for denying access,
> > > > > > > not to fill content and not to write anything to filesystem.
> > > > > > > It's worth noting that returning EBUSY in that case would be more meaningful
> > > > > > > to users.
> > > > > > >
> > > > > > > That's one case in favor of allowing FAN_DENY_ERRNO for FAN_OPEN_PERM,
> > > > > > > but mainly I do not have a proof that people will not need it.
> > > > > > >
> > > > > > > OTOH, I am a bit concerned that this will encourage developer to use
> > > > > > > FAN_OPEN_PERM as a trigger to filling file content and then we are back to
> > > > > > > deadlock risk zone.
> > > > > > >
> > > > > > > Not sure which way to go.
> > > > > > >
> > > > > > > Anyway, I think we agree that there is no reason to merge FAN_DENY_ERRNO
> > > > > > > before FAN_PRE_* events, so we can continue this discussion later when
> > > > > > > I post FAN_PRE_* patches - not for this cycle.
> > > > > >
> > > > > > I started to prepare the pre-content events patches for posting and got back
> > > > > > to this one as well.
> > > > > >
> > > > > > Since we had this discussion I have learned of another use case that
> > > > > > requires filling file content in FAN_OPEN_PERM hook, FAN_OPEN_EXEC_PERM
> > > > > > to be exact.
> > > > > >
> > > > > > The reason is that unless an executable content is filled at execve() time,
> > > > > > there is no other opportunity to fill its content without getting -ETXTBSY.
> > > > >
> > > > > Yes, I've been scratching my head over this usecase for a few days. I was
> > > > > thinking whether we could somehow fill in executable (and executed) files on
> > > > > access but it all seemed too hacky so I agree that we probably have to fill
> > > > > them in on open.
> > > > >
> > > >
> > > > Normally, I think there will not be a really huge executable(?)
> > > > If there were huge executables, they would have likely been broken down
> > > > into smaller loadable libraries which should allow more granular
> > > > content filling,
> > > > but I guess there will always be worst case exceptions.
> > > >
> > > > > > So to keep things more flexible, I decided to add -ETXTBSY to the
> > > > > > allowed errors with FAN_DENY_ERRNO() and to decided to allow
> > > > > > FAN_DENY_ERRNO() with all permission events.
> > > > > >
> > > > > > To keep FAN_DENY_ERRNO() a bit more focused on HSM, I have
> > > > > > added a limitation that FAN_DENY_ERRNO() is allowed only for
> > > > > > FAN_CLASS_PRE_CONTENT groups.
> > > > >
> > > > > I have no problem with adding -ETXTBSY to the set of allowed errors. That
> > > > > makes sense. Adding FAN_DENY_ERRNO() to all permission events in
> > > > > FAN_CLASS_PRE_CONTENT groups - OK,
> > > >
> > > > done that.
> > > >
> > > > I am still not very happy about FAN_OPEN_PERM being part of HSM
> > > > event family when I know that O_TRUCT and O_CREAT call this hook
> > > > with sb writers held.
> > > >
> > > > The irony, is that there is no chance that O_TRUNC will require filling
> > > > content, same if the file is actually being created by O_CREAT, so the
> > > > cases where sb writers is actually needed and the case where content
> > > > filling is needed do not overlap, but I cannot figure out how to get those
> > > > cases out of the HSM risk zone. Ideas?
> > > >
> > >
> > > Jan,
> > >
> > > I wanted to run an idea by you.
> > >
> > > I like your idea to start a clean slate with
> > > FAN_CLASS_PRE_CONTENT | FAN_REPORT_FID
> > > and it would be nice if we could restrict this HSM to use
> > > pre-content events, which is why I was not happy about allowing
> > > FAN_DENY_ERRNO() for the legacy FAN_OPEN*_PERM events,
> > > especially with the known deadlocks.
> > >
> > > Since we already know that we need to generate
> > > FAN_PRE_ACCESS(offset,length) for read-only mmap() and
> > > FAN_PRE_MODIFY(offset,length) for writable mmap(),
> > > we could treat uselib() and execve() the same way and generate
> > > FAN_PRE_ACCESS(0,i_size) as if the file was mmaped.
> >
> > BTW uselib() is deprecated and there is a patch queued to not generate
> > OPEN_EXEC events for it because it was causing problems (not the generation
> > of events itself but the FMODE_EXEC bit being set in uselib). So I don't
> > think we need to instrument uselib().
> >
> 
> Great. The fewer the better :)
> 
> BTW, for mmap, I was thinking of adding fsnotify_file_perm() next to
> call sites of security_mmap_file(), but I see that:
> 1. shmat() has security_mmap_file() - is it relevant?

Well, this is effectively mmap(2) of a tmpfs file. So I don't think this is
particularly useful for HSM purposes but we should probably have it for
consistency?

> 2. remap_file_pages() calls do_mmap() without security_mmap_file() -
>     do we need to cover it?

Hmm, AFAIU remap_file_pages() just allows you to mess with an existing
mapping so it isn't very interesting from HSM POV?

> > > I've already pushed a POC to fan_pre_content branch [1].
> > > Only sanity tested that nothing else is broken.
> > > I still need to add the mmap hooks and test the new events.
> > >
> > > With this, HSM will have appropriate hooks to fill executable
> > > and library on first access and also fill arbitrary files on open
> > > including the knowledge if the file was opened for write.
> > >
> > > Thoughts?
> >
> > Yeah, I guess this looks sensible.
> 
> Cool, so let's see, what is left to do for the plan of
> FAN_CLASS_PRE_CONTENT | FAN_REPORT_FID?
> 
> 1. event->fd is O_PATH mount_fd for open_by_handle_at()
> 2. open_by_handle_at() inherits FMODE_NONOTIFY from mount_fd
> 3. either implement the FAN_CLOSE_FD response flag (easy?) and/or
>     implement FAN_REPORT_EVENT_ID and new header format
> 
> Anything else?
> Are you ok with 1 and 2?

I'm not sure about 1) and 2) so I'm mostly thinking out loud now. AFAIU you
want to provide mount_fd only because of FMODE_NONOTIFY inheritance so 2)
is the key question. But if you provide mount_fd with FMODE_NONOTIFY and
have FMODE_NONOTIFY inheritance, then what's the difference to just allow
opens with FMODE_NONOTIFY from the start? I don't think restricting
FMODE_NONOTIFY to inheritance gives any additional strong security
guarantees?

I understand so far we didn't expose FMODE_NONOTIFY so that people cannot
bypass fanotify permission events. But now we need a sensible way to fill
in the filesystem without deadlocking on our own watches. Maybe exposing
FMODE_NONOTIFY as an open flag is too attractive for misuse so could be
somehow tie it to the HSM app? We were already discussing that we need to
somehow register the HSM app with the filesystem to avoid issues when the
app crashes or so. So maybe we could tie this registration to the ability
of bypassing generation of notification?

> Do you have a preference for 3?

Yeah. FAN_CLOSE_FD seems as a hack to me (and you're likely to get it wrong
until you've spent quite some time debugging with exec sometimes fails with
ETXTBUSY). So I prefer FAN_REPORT_EVENT_ID and use event id as an
identifier when replying to HSM events.

								Honza
Amir Goldstein Feb. 12, 2024, 2:56 p.m. UTC | #14
On Mon, Feb 12, 2024 at 2:01 PM Jan Kara <jack@suse.cz> wrote:
>
> On Thu 08-02-24 21:21:13, Amir Goldstein wrote:
> > On Thu, Feb 8, 2024 at 8:31 PM Jan Kara <jack@suse.cz> wrote:
> > > On Thu 08-02-24 16:04:29, Amir Goldstein wrote:
> > > > > > On Mon 29-01-24 20:30:34, Amir Goldstein wrote:
> > > > > > > On Mon, Dec 18, 2023 at 5:53 PM Amir Goldstein <amir73il@gmail.com> wrote:
> > > > > > > > In the HttpDirFS HSM demo, I used FAN_OPEN_PERM on a mount mark
> > > > > > > > to deny open of file during the short time that it's content is being
> > > > > > > > punched out.
> > > > > > > > It is quite complicated to explain, but I only used it for denying access,
> > > > > > > > not to fill content and not to write anything to filesystem.
> > > > > > > > It's worth noting that returning EBUSY in that case would be more meaningful
> > > > > > > > to users.
> > > > > > > >
> > > > > > > > That's one case in favor of allowing FAN_DENY_ERRNO for FAN_OPEN_PERM,
> > > > > > > > but mainly I do not have a proof that people will not need it.
> > > > > > > >
> > > > > > > > OTOH, I am a bit concerned that this will encourage developer to use
> > > > > > > > FAN_OPEN_PERM as a trigger to filling file content and then we are back to
> > > > > > > > deadlock risk zone.
> > > > > > > >
> > > > > > > > Not sure which way to go.
> > > > > > > >
> > > > > > > > Anyway, I think we agree that there is no reason to merge FAN_DENY_ERRNO
> > > > > > > > before FAN_PRE_* events, so we can continue this discussion later when
> > > > > > > > I post FAN_PRE_* patches - not for this cycle.
> > > > > > >
> > > > > > > I started to prepare the pre-content events patches for posting and got back
> > > > > > > to this one as well.
> > > > > > >
> > > > > > > Since we had this discussion I have learned of another use case that
> > > > > > > requires filling file content in FAN_OPEN_PERM hook, FAN_OPEN_EXEC_PERM
> > > > > > > to be exact.
> > > > > > >
> > > > > > > The reason is that unless an executable content is filled at execve() time,
> > > > > > > there is no other opportunity to fill its content without getting -ETXTBSY.
> > > > > >
> > > > > > Yes, I've been scratching my head over this usecase for a few days. I was
> > > > > > thinking whether we could somehow fill in executable (and executed) files on
> > > > > > access but it all seemed too hacky so I agree that we probably have to fill
> > > > > > them in on open.
> > > > > >
> > > > >
> > > > > Normally, I think there will not be a really huge executable(?)
> > > > > If there were huge executables, they would have likely been broken down
> > > > > into smaller loadable libraries which should allow more granular
> > > > > content filling,
> > > > > but I guess there will always be worst case exceptions.
> > > > >
> > > > > > > So to keep things more flexible, I decided to add -ETXTBSY to the
> > > > > > > allowed errors with FAN_DENY_ERRNO() and to decided to allow
> > > > > > > FAN_DENY_ERRNO() with all permission events.
> > > > > > >
> > > > > > > To keep FAN_DENY_ERRNO() a bit more focused on HSM, I have
> > > > > > > added a limitation that FAN_DENY_ERRNO() is allowed only for
> > > > > > > FAN_CLASS_PRE_CONTENT groups.
> > > > > >
> > > > > > I have no problem with adding -ETXTBSY to the set of allowed errors. That
> > > > > > makes sense. Adding FAN_DENY_ERRNO() to all permission events in
> > > > > > FAN_CLASS_PRE_CONTENT groups - OK,
> > > > >
> > > > > done that.
> > > > >
> > > > > I am still not very happy about FAN_OPEN_PERM being part of HSM
> > > > > event family when I know that O_TRUCT and O_CREAT call this hook
> > > > > with sb writers held.
> > > > >
> > > > > The irony, is that there is no chance that O_TRUNC will require filling
> > > > > content, same if the file is actually being created by O_CREAT, so the
> > > > > cases where sb writers is actually needed and the case where content
> > > > > filling is needed do not overlap, but I cannot figure out how to get those
> > > > > cases out of the HSM risk zone. Ideas?
> > > > >
> > > >
> > > > Jan,
> > > >
> > > > I wanted to run an idea by you.
> > > >
> > > > I like your idea to start a clean slate with
> > > > FAN_CLASS_PRE_CONTENT | FAN_REPORT_FID
> > > > and it would be nice if we could restrict this HSM to use
> > > > pre-content events, which is why I was not happy about allowing
> > > > FAN_DENY_ERRNO() for the legacy FAN_OPEN*_PERM events,
> > > > especially with the known deadlocks.
> > > >
> > > > Since we already know that we need to generate
> > > > FAN_PRE_ACCESS(offset,length) for read-only mmap() and
> > > > FAN_PRE_MODIFY(offset,length) for writable mmap(),
> > > > we could treat uselib() and execve() the same way and generate
> > > > FAN_PRE_ACCESS(0,i_size) as if the file was mmaped.
> > >
> > > BTW uselib() is deprecated and there is a patch queued to not generate
> > > OPEN_EXEC events for it because it was causing problems (not the generation
> > > of events itself but the FMODE_EXEC bit being set in uselib). So I don't
> > > think we need to instrument uselib().
> > >
> >
> > Great. The fewer the better :)
> >
> > BTW, for mmap, I was thinking of adding fsnotify_file_perm() next to
> > call sites of security_mmap_file(), but I see that:
> > 1. shmat() has security_mmap_file() - is it relevant?
>
> Well, this is effectively mmap(2) of a tmpfs file. So I don't think this is
> particularly useful for HSM purposes but we should probably have it for
> consistency?

OK.

Actually, I think there is a use case for tmpfs + HSM. HSM is about
lazy populating of a local filesystem from a remote slower tier.
The fast local tier could be persistent, but could just as well be tmpfs,
for a temporary lazy local mirror.

This use case reminds overlayfs a bit, where the local tier is the upper fs,
but in case of HSM, the "copy up" happens also on read.

I do want to limit HSM to "fully featured local fs", so we do not end up
with crazy fs doing HSM. The very minimum should be decodable
file handle support and unique fsid, but we could also think of more strict
requirements(?). Anyway, tmpfs should meet all the "fully featured local fs"
requirement.

>
> > 2. remap_file_pages() calls do_mmap() without security_mmap_file() -
> >     do we need to cover it?
>
> Hmm, AFAIU remap_file_pages() just allows you to mess with an existing
> mapping so it isn't very interesting from HSM POV?

Nope.

>
> > > > I've already pushed a POC to fan_pre_content branch [1].
> > > > Only sanity tested that nothing else is broken.
> > > > I still need to add the mmap hooks and test the new events.
> > > >
> > > > With this, HSM will have appropriate hooks to fill executable
> > > > and library on first access and also fill arbitrary files on open
> > > > including the knowledge if the file was opened for write.
> > > >
> > > > Thoughts?
> > >
> > > Yeah, I guess this looks sensible.
> >
> > Cool, so let's see, what is left to do for the plan of
> > FAN_CLASS_PRE_CONTENT | FAN_REPORT_FID?
> >
> > 1. event->fd is O_PATH mount_fd for open_by_handle_at()
> > 2. open_by_handle_at() inherits FMODE_NONOTIFY from mount_fd
> > 3. either implement the FAN_CLOSE_FD response flag (easy?) and/or
> >     implement FAN_REPORT_EVENT_ID and new header format
> >
> > Anything else?
> > Are you ok with 1 and 2?
>
> I'm not sure about 1) and 2) so I'm mostly thinking out loud now. AFAIU you
> want to provide mount_fd only because of FMODE_NONOTIFY inheritance so 2)
> is the key question. But if you provide mount_fd with FMODE_NONOTIFY and
> have FMODE_NONOTIFY inheritance, then what's the difference to just allow
> opens with FMODE_NONOTIFY from the start? I don't think restricting
> FMODE_NONOTIFY to inheritance gives any additional strong security
> guarantees?
>

You are right.
Just need to think of the right API.
It is just very convenient to be getting the FMODE_NONOTIFY from
fanotify, but let's think about it.

> I understand so far we didn't expose FMODE_NONOTIFY so that people cannot
> bypass fanotify permission events. But now we need a sensible way to fill
> in the filesystem without deadlocking on our own watches. Maybe exposing
> FMODE_NONOTIFY as an open flag is too attractive for misuse so could be
> somehow tie it to the HSM app? We were already discussing that we need to
> somehow register the HSM app with the filesystem to avoid issues when the
> app crashes or so. So maybe we could tie this registration to the ability
> of bypassing generation of notification?
>

Last time we discussed this the conclusion was an API of a group-less
default mask, for example:

1. fanotify_mark(FAN_GROUP_DEFAULT,
                           FAN_MARK_ADD | FAN_MARK_MOUNT,
                           FAN_PRE_ACCESS, AT_FDCWD, path);
2. this returns -EPERM for access until some group handles FAN_PRE_ACCESS
3. then HSM is started and subscribes to FAN_PRE_ACCESS
4. and then the mount is moved or bind mounted into a path exported to users

It is a simple solution that should be easy to implement.
But it does not involve "register the HSM app with the filesystem",
unless you mean that a process that opens an HSM group
(FAN_REPORT_FID|FAN_CLASS_PRE_CONTENT) should automatically
be given FMODE_NONOTIFY files?

There is one more crazy idea that I was pondering -
what if we used the fanotify_fd as mount_fd arg to open_by_handle_at()?
The framing is that it is not the filesystem, but fanotify which actually
encoded the fsid+fid, so HSM could be asking fanotify to decode them.
Technically, the group could keep a unique map from fsid -> sb, then
fanotify group could decode an fanotify_event_info_fid buffer to a specific
inode on a specific fs.
Naturally, those decoded files would be FMODE_NONOTIFY.

Too crazy?

I am very much open to a simpler suggestion.

> > Do you have a preference for 3?
>
> Yeah. FAN_CLOSE_FD seems as a hack to me (and you're likely to get it wrong
> until you've spent quite some time debugging with exec sometimes fails with
> ETXTBUSY). So I prefer FAN_REPORT_EVENT_ID and use event id as an
> identifier when replying to HSM events.

OK.

At the end of this week I am going on a one week vacation,
so it is unlikely that pre-content events will be ready for 6.9.
Perhaps I will be able to get some prep work ready for the
merge window, we shall see.

Thanks,
Amir.
Jan Kara Feb. 15, 2024, 11:51 a.m. UTC | #15
On Mon 12-02-24 16:56:28, Amir Goldstein wrote:
> On Mon, Feb 12, 2024 at 2:01 PM Jan Kara <jack@suse.cz> wrote:
> > On Thu 08-02-24 21:21:13, Amir Goldstein wrote:
> > > On Thu, Feb 8, 2024 at 8:31 PM Jan Kara <jack@suse.cz> wrote:
> > > > On Thu 08-02-24 16:04:29, Amir Goldstein wrote:
> > > > > > > On Mon 29-01-24 20:30:34, Amir Goldstein wrote:
> > > > > > > > On Mon, Dec 18, 2023 at 5:53 PM Amir Goldstein <amir73il@gmail.com> wrote:
> > > > > > > > > In the HttpDirFS HSM demo, I used FAN_OPEN_PERM on a mount mark
> > > > > > > > > to deny open of file during the short time that it's content is being
> > > > > > > > > punched out.
> > > > > > > > > It is quite complicated to explain, but I only used it for denying access,
> > > > > > > > > not to fill content and not to write anything to filesystem.
> > > > > > > > > It's worth noting that returning EBUSY in that case would be more meaningful
> > > > > > > > > to users.
> > > > > > > > >
> > > > > > > > > That's one case in favor of allowing FAN_DENY_ERRNO for FAN_OPEN_PERM,
> > > > > > > > > but mainly I do not have a proof that people will not need it.
> > > > > > > > >
> > > > > > > > > OTOH, I am a bit concerned that this will encourage developer to use
> > > > > > > > > FAN_OPEN_PERM as a trigger to filling file content and then we are back to
> > > > > > > > > deadlock risk zone.
> > > > > > > > >
> > > > > > > > > Not sure which way to go.
> > > > > > > > >
> > > > > > > > > Anyway, I think we agree that there is no reason to merge FAN_DENY_ERRNO
> > > > > > > > > before FAN_PRE_* events, so we can continue this discussion later when
> > > > > > > > > I post FAN_PRE_* patches - not for this cycle.
> > > > > > > >
> > > > > > > > I started to prepare the pre-content events patches for posting and got back
> > > > > > > > to this one as well.
> > > > > > > >
> > > > > > > > Since we had this discussion I have learned of another use case that
> > > > > > > > requires filling file content in FAN_OPEN_PERM hook, FAN_OPEN_EXEC_PERM
> > > > > > > > to be exact.
> > > > > > > >
> > > > > > > > The reason is that unless an executable content is filled at execve() time,
> > > > > > > > there is no other opportunity to fill its content without getting -ETXTBSY.
> > > > > > >
> > > > > > > Yes, I've been scratching my head over this usecase for a few days. I was
> > > > > > > thinking whether we could somehow fill in executable (and executed) files on
> > > > > > > access but it all seemed too hacky so I agree that we probably have to fill
> > > > > > > them in on open.
> > > > > > >
> > > > > >
> > > > > > Normally, I think there will not be a really huge executable(?)
> > > > > > If there were huge executables, they would have likely been broken down
> > > > > > into smaller loadable libraries which should allow more granular
> > > > > > content filling,
> > > > > > but I guess there will always be worst case exceptions.
> > > > > >
> > > > > > > > So to keep things more flexible, I decided to add -ETXTBSY to the
> > > > > > > > allowed errors with FAN_DENY_ERRNO() and to decided to allow
> > > > > > > > FAN_DENY_ERRNO() with all permission events.
> > > > > > > >
> > > > > > > > To keep FAN_DENY_ERRNO() a bit more focused on HSM, I have
> > > > > > > > added a limitation that FAN_DENY_ERRNO() is allowed only for
> > > > > > > > FAN_CLASS_PRE_CONTENT groups.
> > > > > > >
> > > > > > > I have no problem with adding -ETXTBSY to the set of allowed errors. That
> > > > > > > makes sense. Adding FAN_DENY_ERRNO() to all permission events in
> > > > > > > FAN_CLASS_PRE_CONTENT groups - OK,
> > > > > >
> > > > > > done that.
> > > > > >
> > > > > > I am still not very happy about FAN_OPEN_PERM being part of HSM
> > > > > > event family when I know that O_TRUCT and O_CREAT call this hook
> > > > > > with sb writers held.
> > > > > >
> > > > > > The irony, is that there is no chance that O_TRUNC will require filling
> > > > > > content, same if the file is actually being created by O_CREAT, so the
> > > > > > cases where sb writers is actually needed and the case where content
> > > > > > filling is needed do not overlap, but I cannot figure out how to get those
> > > > > > cases out of the HSM risk zone. Ideas?
> > > > > >
> > > > >
> > > > > Jan,
> > > > >
> > > > > I wanted to run an idea by you.
> > > > >
> > > > > I like your idea to start a clean slate with
> > > > > FAN_CLASS_PRE_CONTENT | FAN_REPORT_FID
> > > > > and it would be nice if we could restrict this HSM to use
> > > > > pre-content events, which is why I was not happy about allowing
> > > > > FAN_DENY_ERRNO() for the legacy FAN_OPEN*_PERM events,
> > > > > especially with the known deadlocks.
> > > > >
> > > > > Since we already know that we need to generate
> > > > > FAN_PRE_ACCESS(offset,length) for read-only mmap() and
> > > > > FAN_PRE_MODIFY(offset,length) for writable mmap(),
> > > > > we could treat uselib() and execve() the same way and generate
> > > > > FAN_PRE_ACCESS(0,i_size) as if the file was mmaped.
> > > >
> > > > BTW uselib() is deprecated and there is a patch queued to not generate
> > > > OPEN_EXEC events for it because it was causing problems (not the generation
> > > > of events itself but the FMODE_EXEC bit being set in uselib). So I don't
> > > > think we need to instrument uselib().
> > > >
> > >
> > > Great. The fewer the better :)
> > >
> > > BTW, for mmap, I was thinking of adding fsnotify_file_perm() next to
> > > call sites of security_mmap_file(), but I see that:
> > > 1. shmat() has security_mmap_file() - is it relevant?
> >
> > Well, this is effectively mmap(2) of a tmpfs file. So I don't think this is
> > particularly useful for HSM purposes but we should probably have it for
> > consistency?
> 
> OK.
> 
> Actually, I think there is a use case for tmpfs + HSM. HSM is about
> lazy populating of a local filesystem from a remote slower tier.
> The fast local tier could be persistent, but could just as well be tmpfs,
> for a temporary lazy local mirror.
> 
> This use case reminds overlayfs a bit, where the local tier is the upper fs,
> but in case of HSM, the "copy up" happens also on read.
> 
> I do want to limit HSM to "fully featured local fs", so we do not end up
> with crazy fs doing HSM. The very minimum should be decodable
> file handle support and unique fsid, but we could also think of more strict
> requirements(?). Anyway, tmpfs should meet all the "fully featured local fs"
> requirement.

Ok, makes sense.

> > > > > I've already pushed a POC to fan_pre_content branch [1].
> > > > > Only sanity tested that nothing else is broken.
> > > > > I still need to add the mmap hooks and test the new events.
> > > > >
> > > > > With this, HSM will have appropriate hooks to fill executable
> > > > > and library on first access and also fill arbitrary files on open
> > > > > including the knowledge if the file was opened for write.
> > > > >
> > > > > Thoughts?
> > > >
> > > > Yeah, I guess this looks sensible.
> > >
> > > Cool, so let's see, what is left to do for the plan of
> > > FAN_CLASS_PRE_CONTENT | FAN_REPORT_FID?
> > >
> > > 1. event->fd is O_PATH mount_fd for open_by_handle_at()
> > > 2. open_by_handle_at() inherits FMODE_NONOTIFY from mount_fd
> > > 3. either implement the FAN_CLOSE_FD response flag (easy?) and/or
> > >     implement FAN_REPORT_EVENT_ID and new header format
> > >
> > > Anything else?
> > > Are you ok with 1 and 2?
> >
> > I'm not sure about 1) and 2) so I'm mostly thinking out loud now. AFAIU you
> > want to provide mount_fd only because of FMODE_NONOTIFY inheritance so 2)
> > is the key question. But if you provide mount_fd with FMODE_NONOTIFY and
> > have FMODE_NONOTIFY inheritance, then what's the difference to just allow
> > opens with FMODE_NONOTIFY from the start? I don't think restricting
> > FMODE_NONOTIFY to inheritance gives any additional strong security
> > guarantees?
> >
> 
> You are right.
> Just need to think of the right API.
> It is just very convenient to be getting the FMODE_NONOTIFY from
> fanotify, but let's think about it.
> 
> > I understand so far we didn't expose FMODE_NONOTIFY so that people cannot
> > bypass fanotify permission events. But now we need a sensible way to fill
> > in the filesystem without deadlocking on our own watches. Maybe exposing
> > FMODE_NONOTIFY as an open flag is too attractive for misuse so could be
> > somehow tie it to the HSM app? We were already discussing that we need to
> > somehow register the HSM app with the filesystem to avoid issues when the
> > app crashes or so. So maybe we could tie this registration to the ability
> > of bypassing generation of notification?
> >
> 
> Last time we discussed this the conclusion was an API of a group-less
> default mask, for example:
> 
> 1. fanotify_mark(FAN_GROUP_DEFAULT,
>                            FAN_MARK_ADD | FAN_MARK_MOUNT,
>                            FAN_PRE_ACCESS, AT_FDCWD, path);
> 2. this returns -EPERM for access until some group handles FAN_PRE_ACCESS
> 3. then HSM is started and subscribes to FAN_PRE_ACCESS
> 4. and then the mount is moved or bind mounted into a path exported to users

Yes, this was the process I was talking about.
 
> It is a simple solution that should be easy to implement.
> But it does not involve "register the HSM app with the filesystem",
> unless you mean that a process that opens an HSM group
> (FAN_REPORT_FID|FAN_CLASS_PRE_CONTENT) should automatically
> be given FMODE_NONOTIFY files?

Two ideas: What you describe above seems like what the new mount API was
intended for? What if we introduced something like an "hsm" mount option
which would basically enable calling into pre-content event handlers
(for sb without this flag handlers wouldn't be called and you cannot place
pre-content marks on such sb). These handlers would return EACCESS unless
there's somebody handling events and returning something else.

You could then do:

fan_fd = fanotify_init()
ffd = fsopen()
fsconfig(ffd, FSCONFIG_SET_STRING, "source", device, 0)
fsconfig(ffd, FSCONFIG_SET_FLAG, "hsm", NULL, 0)
rootfd = fsconfig(ffd, FSCONFIG_CMD_CREATE, NULL, NULL, 0)
fanotify_mark(fan_fd, FAN_MARK_ADD, ... , rootfd, NULL)
<now you can move the superblock into the mount hierarchy>

This would elegantly solve the "what if HSM handler dies" problem as well
as cleanly handle the setup. And we don't have to come up with a concept of
"default mask". Now we still have the problem how to fill in the filesystem
on pre-content event without deadlocking. As I was thinking about it the
most elegant solution would IMHO be if the HSM handler could have a private
mount flagged so that HSM is excluded from there (or it could place ignore
mark on this mount for HSM events). I think we've discarded similar ideas
in the past because this is problematic with directory pre-content events
because security hooks don't get the mountpoint. But what if we used
security_path_* hooks for directory pre-content events?

> There is one more crazy idea that I was pondering -
> what if we used the fanotify_fd as mount_fd arg to open_by_handle_at()?
> The framing is that it is not the filesystem, but fanotify which actually
> encoded the fsid+fid, so HSM could be asking fanotify to decode them.
> Technically, the group could keep a unique map from fsid -> sb, then
> fanotify group could decode an fanotify_event_info_fid buffer to a specific
> inode on a specific fs.
> Naturally, those decoded files would be FMODE_NONOTIFY.
> 
> Too crazy?

It sounds a bit complex and hooking into open_by_handle_at() code for this
sounds a bit hacky. But I'm not completely rejecting this possibility if we
don't find other options.

								Honza
Amir Goldstein Feb. 15, 2024, 3:40 p.m. UTC | #16
> > Last time we discussed this the conclusion was an API of a group-less
> > default mask, for example:
> >
> > 1. fanotify_mark(FAN_GROUP_DEFAULT,
> >                            FAN_MARK_ADD | FAN_MARK_MOUNT,
> >                            FAN_PRE_ACCESS, AT_FDCWD, path);
> > 2. this returns -EPERM for access until some group handles FAN_PRE_ACCESS
> > 3. then HSM is started and subscribes to FAN_PRE_ACCESS
> > 4. and then the mount is moved or bind mounted into a path exported to users
>
> Yes, this was the process I was talking about.
>
> > It is a simple solution that should be easy to implement.
> > But it does not involve "register the HSM app with the filesystem",
> > unless you mean that a process that opens an HSM group
> > (FAN_REPORT_FID|FAN_CLASS_PRE_CONTENT) should automatically
> > be given FMODE_NONOTIFY files?
>
> Two ideas: What you describe above seems like what the new mount API was
> intended for? What if we introduced something like an "hsm" mount option
> which would basically enable calling into pre-content event handlers

I like that.
I forgot that with my suggestion we'd need a path to setup
the default mask.

> (for sb without this flag handlers wouldn't be called and you cannot place
> pre-content marks on such sb).

IMO, that limitation (i.e. inside brackets) is too restrictive.
In many cases, the user running HSM may not have control over the
mount of the filesystem (inside containers?).
It is true that HSM without anti-crash protection is less reliable,
but I think that it is still useful enough that users will want the
option to run it (?).

Think of my HttpDirFS demo - it's just a simple lazy mirroring
of a website. Even with low reliability I think it is useful (?).

> These handlers would return EACCESS unless
> there's somebody handling events and returning something else.
>
> You could then do:
>
> fan_fd = fanotify_init()
> ffd = fsopen()
> fsconfig(ffd, FSCONFIG_SET_STRING, "source", device, 0)
> fsconfig(ffd, FSCONFIG_SET_FLAG, "hsm", NULL, 0)
> rootfd = fsconfig(ffd, FSCONFIG_CMD_CREATE, NULL, NULL, 0)
> fanotify_mark(fan_fd, FAN_MARK_ADD, ... , rootfd, NULL)
> <now you can move the superblock into the mount hierarchy>
>

Not too bad.
I think that "hsm_deny_mask=" mount options would give more flexibility,
but I could be convinced otherwise.

It's probably not a great idea to be running two different HSMs on the same
fs anyway, but if user has an old HSM version installed that handles only
pre-content events, I don't think that we want this old version if it happens
to be run by mistake, to allow for unsupervised create,rename,delete if the
admin wanted to atomically mount a fs that SHOULD be supervised by a
v2 HSM that knows how to handle pre-path events.

IOW, and "HSM bit" on sb is too broad IMO.

> This would elegantly solve the "what if HSM handler dies" problem as well
> as cleanly handle the setup. And we don't have to come up with a concept of
> "default mask".

We can still have a mask, it's just about the API to set it up.

> Now we still have the problem how to fill in the filesystem
> on pre-content event without deadlocking. As I was thinking about it the
> most elegant solution would IMHO be if the HSM handler could have a private
> mount flagged so that HSM is excluded from there (or it could place ignore
> mark on this mount for HSM events).

My HttpDirFS demo does it the other way around - HSM uses a mount
without a mark mount - but ignore mark works too.

> I think we've discarded similar ideas
> in the past because this is problematic with directory pre-content events
> because security hooks don't get the mountpoint. But what if we used
> security_path_* hooks for directory pre-content events?
>

No need for security_path_ * hooks.
In my POC, the pre-path hooks have the path argument.
For people who are not familiar with the term, here is man page draft
for "pre-path" events:
https://github.com/amir73il/man-pages/commits/fan_pre_path/

This is an out of date branch from the time that I called them
FAN_PRE_{CREATE,DELETE,MOVE_*} events:
https://github.com/amir73il/linux/commit/29c60e4db3068ff2cd7b2c5a73108afb2c19b868

They are implemented by replacing the mnt_want_write() calls
with mnt_want_write_{path,parent,parents}() calls.

This was done to make sure that they take the sb write srcu and call
the pre-path hook before taking sb writers freeze protection.

> > There is one more crazy idea that I was pondering -
> > what if we used the fanotify_fd as mount_fd arg to open_by_handle_at()?
> > The framing is that it is not the filesystem, but fanotify which actually
> > encoded the fsid+fid, so HSM could be asking fanotify to decode them.
> > Technically, the group could keep a unique map from fsid -> sb, then
> > fanotify group could decode an fanotify_event_info_fid buffer to a specific
> > inode on a specific fs.
> > Naturally, those decoded files would be FMODE_NONOTIFY.
> >
> > Too crazy?
>
> It sounds a bit complex and hooking into open_by_handle_at() code for this
> sounds a bit hacky. But I'm not completely rejecting this possibility if we
> don't find other options.

Your idea sounds better :)

Thanks,
Amir.
Jan Kara Feb. 19, 2024, 11:01 a.m. UTC | #17
On Thu 15-02-24 17:40:07, Amir Goldstein wrote:
> > > Last time we discussed this the conclusion was an API of a group-less
> > > default mask, for example:
> > >
> > > 1. fanotify_mark(FAN_GROUP_DEFAULT,
> > >                            FAN_MARK_ADD | FAN_MARK_MOUNT,
> > >                            FAN_PRE_ACCESS, AT_FDCWD, path);
> > > 2. this returns -EPERM for access until some group handles FAN_PRE_ACCESS
> > > 3. then HSM is started and subscribes to FAN_PRE_ACCESS
> > > 4. and then the mount is moved or bind mounted into a path exported to users
> >
> > Yes, this was the process I was talking about.
> >
> > > It is a simple solution that should be easy to implement.
> > > But it does not involve "register the HSM app with the filesystem",
> > > unless you mean that a process that opens an HSM group
> > > (FAN_REPORT_FID|FAN_CLASS_PRE_CONTENT) should automatically
> > > be given FMODE_NONOTIFY files?
> >
> > Two ideas: What you describe above seems like what the new mount API was
> > intended for? What if we introduced something like an "hsm" mount option
> > which would basically enable calling into pre-content event handlers
> 
> I like that.
> I forgot that with my suggestion we'd need a path to setup
> the default mask.
> 
> > (for sb without this flag handlers wouldn't be called and you cannot place
> > pre-content marks on such sb).
> 
> IMO, that limitation (i.e. inside brackets) is too restrictive.
> In many cases, the user running HSM may not have control over the
> mount of the filesystem (inside containers?).
> It is true that HSM without anti-crash protection is less reliable,
> but I think that it is still useful enough that users will want the
> option to run it (?).
> 
> Think of my HttpDirFS demo - it's just a simple lazy mirroring
> of a website. Even with low reliability I think it is useful (?).

Yeah, ok, makes sense. But for such "unpriviledged" usecases we don't have
a deadlock-free way to fill in the file contents because that requires a
special mountpoint?

> > These handlers would return EACCESS unless
> > there's somebody handling events and returning something else.
> >
> > You could then do:
> >
> > fan_fd = fanotify_init()
> > ffd = fsopen()
> > fsconfig(ffd, FSCONFIG_SET_STRING, "source", device, 0)
> > fsconfig(ffd, FSCONFIG_SET_FLAG, "hsm", NULL, 0)
> > rootfd = fsconfig(ffd, FSCONFIG_CMD_CREATE, NULL, NULL, 0)
> > fanotify_mark(fan_fd, FAN_MARK_ADD, ... , rootfd, NULL)
> > <now you can move the superblock into the mount hierarchy>
> 
> Not too bad.
> I think that "hsm_deny_mask=" mount options would give more flexibility,
> but I could be convinced otherwise.
> 
> It's probably not a great idea to be running two different HSMs on the same
> fs anyway, but if user has an old HSM version installed that handles only
> pre-content events, I don't think that we want this old version if it happens
> to be run by mistake, to allow for unsupervised create,rename,delete if the
> admin wanted to atomically mount a fs that SHOULD be supervised by a
> v2 HSM that knows how to handle pre-path events.
> 
> IOW, and "HSM bit" on sb is too broad IMO.

OK. So "hsm_deny_mask=" would esentially express events that we require HSM
to handle, the rest would just be accepted by default. That makes sense.
The only thing I kind of dislike is that this ties fanotify API with mount
API. So perhaps hsm_deny_mask should be specified as a string? Like
preaccess,premodify,prelookup,... and transformed into a bitmask only
inside the kernel? It gives us more maneuvering space for the future.

> > This would elegantly solve the "what if HSM handler dies" problem as well
> > as cleanly handle the setup. And we don't have to come up with a concept of
> > "default mask".
> 
> We can still have a mask, it's just about the API to set it up.
> 
> > Now we still have the problem how to fill in the filesystem
> > on pre-content event without deadlocking. As I was thinking about it the
> > most elegant solution would IMHO be if the HSM handler could have a private
> > mount flagged so that HSM is excluded from there (or it could place ignore
> > mark on this mount for HSM events).
> 
> My HttpDirFS demo does it the other way around - HSM uses a mount
> without a mark mount - but ignore mark works too.

Yes, the HSM handler is free to setup whatever works for it. I was just
thinking to make sure there is at least one sane way how to do it :)

> > I think we've discarded similar ideas
> > in the past because this is problematic with directory pre-content events
> > because security hooks don't get the mountpoint. But what if we used
> > security_path_* hooks for directory pre-content events?
> 
> No need for security_path_ * hooks.
> In my POC, the pre-path hooks have the path argument.
> For people who are not familiar with the term, here is man page draft
> for "pre-path" events:
> https://github.com/amir73il/man-pages/commits/fan_pre_path/
> 
> This is an out of date branch from the time that I called them
> FAN_PRE_{CREATE,DELETE,MOVE_*} events:
> https://github.com/amir73il/linux/commit/29c60e4db3068ff2cd7b2c5a73108afb2c19b868
> 
> They are implemented by replacing the mnt_want_write() calls
> with mnt_want_write_{path,parent,parents}() calls.
> 
> This was done to make sure that they take the sb write srcu and call
> the pre-path hook before taking sb writers freeze protection.

Ok, so AFAIU you agree we don't need to rely on FMODE_NONOTIFY for HSM and
can just use access through dedicated mount for filling in the filesystem?

								Honza
Amir Goldstein Feb. 27, 2024, 7:42 p.m. UTC | #18
On Mon, Feb 19, 2024 at 1:01 PM Jan Kara <jack@suse.cz> wrote:
>
> On Thu 15-02-24 17:40:07, Amir Goldstein wrote:
> > > > Last time we discussed this the conclusion was an API of a group-less
> > > > default mask, for example:
> > > >
> > > > 1. fanotify_mark(FAN_GROUP_DEFAULT,
> > > >                            FAN_MARK_ADD | FAN_MARK_MOUNT,
> > > >                            FAN_PRE_ACCESS, AT_FDCWD, path);
> > > > 2. this returns -EPERM for access until some group handles FAN_PRE_ACCESS
> > > > 3. then HSM is started and subscribes to FAN_PRE_ACCESS
> > > > 4. and then the mount is moved or bind mounted into a path exported to users
> > >
> > > Yes, this was the process I was talking about.
> > >
> > > > It is a simple solution that should be easy to implement.
> > > > But it does not involve "register the HSM app with the filesystem",
> > > > unless you mean that a process that opens an HSM group
> > > > (FAN_REPORT_FID|FAN_CLASS_PRE_CONTENT) should automatically
> > > > be given FMODE_NONOTIFY files?
> > >
> > > Two ideas: What you describe above seems like what the new mount API was
> > > intended for? What if we introduced something like an "hsm" mount option
> > > which would basically enable calling into pre-content event handlers
> >
> > I like that.
> > I forgot that with my suggestion we'd need a path to setup
> > the default mask.
> >
> > > (for sb without this flag handlers wouldn't be called and you cannot place
> > > pre-content marks on such sb).
> >
> > IMO, that limitation (i.e. inside brackets) is too restrictive.
> > In many cases, the user running HSM may not have control over the
> > mount of the filesystem (inside containers?).
> > It is true that HSM without anti-crash protection is less reliable,
> > but I think that it is still useful enough that users will want the
> > option to run it (?).
> >
> > Think of my HttpDirFS demo - it's just a simple lazy mirroring
> > of a website. Even with low reliability I think it is useful (?).
>
> Yeah, ok, makes sense. But for such "unpriviledged" usecases we don't have
> a deadlock-free way to fill in the file contents because that requires a
> special mountpoint?
>

True, unless we also keep the FMODE_NONOTIFY event->fd
for the simple cases. I'll need to think about this some more.

> > > These handlers would return EACCESS unless
> > > there's somebody handling events and returning something else.
> > >
> > > You could then do:
> > >
> > > fan_fd = fanotify_init()
> > > ffd = fsopen()
> > > fsconfig(ffd, FSCONFIG_SET_STRING, "source", device, 0)
> > > fsconfig(ffd, FSCONFIG_SET_FLAG, "hsm", NULL, 0)
> > > rootfd = fsconfig(ffd, FSCONFIG_CMD_CREATE, NULL, NULL, 0)
> > > fanotify_mark(fan_fd, FAN_MARK_ADD, ... , rootfd, NULL)
> > > <now you can move the superblock into the mount hierarchy>
> >
> > Not too bad.
> > I think that "hsm_deny_mask=" mount options would give more flexibility,
> > but I could be convinced otherwise.
> >
> > It's probably not a great idea to be running two different HSMs on the same
> > fs anyway, but if user has an old HSM version installed that handles only
> > pre-content events, I don't think that we want this old version if it happens
> > to be run by mistake, to allow for unsupervised create,rename,delete if the
> > admin wanted to atomically mount a fs that SHOULD be supervised by a
> > v2 HSM that knows how to handle pre-path events.
> >
> > IOW, and "HSM bit" on sb is too broad IMO.
>
> OK. So "hsm_deny_mask=" would esentially express events that we require HSM
> to handle, the rest would just be accepted by default. That makes sense.

Yes.

> The only thing I kind of dislike is that this ties fanotify API with mount
> API. So perhaps hsm_deny_mask should be specified as a string? Like
> preaccess,premodify,prelookup,... and transformed into a bitmask only
> inside the kernel? It gives us more maneuvering space for the future.
>

Urgh. I see what you are saying, but this seems so ugly to me.
I have a strong feeling that we are trying to reinvent something
and that we are going to reinvent it badly.
I need to look for precedents, maybe in other OS.
I believe that in Windows, there is an API to register as a
Cloud Engine Provider, so there is probably a way to have multiple HSMs
working on different sections of the filesystem in some reliable
crash safe manner.

> > > This would elegantly solve the "what if HSM handler dies" problem as well
> > > as cleanly handle the setup. And we don't have to come up with a concept of
> > > "default mask".
> >
> > We can still have a mask, it's just about the API to set it up.
> >
> > > Now we still have the problem how to fill in the filesystem
> > > on pre-content event without deadlocking. As I was thinking about it the
> > > most elegant solution would IMHO be if the HSM handler could have a private
> > > mount flagged so that HSM is excluded from there (or it could place ignore
> > > mark on this mount for HSM events).
> >
> > My HttpDirFS demo does it the other way around - HSM uses a mount
> > without a mark mount - but ignore mark works too.
>
> Yes, the HSM handler is free to setup whatever works for it. I was just
> thinking to make sure there is at least one sane way how to do it :)
>

Yeh, we need to write "best practice" guidelines.

> > > I think we've discarded similar ideas
> > > in the past because this is problematic with directory pre-content events
> > > because security hooks don't get the mountpoint. But what if we used
> > > security_path_* hooks for directory pre-content events?
> >
> > No need for security_path_ * hooks.
> > In my POC, the pre-path hooks have the path argument.
> > For people who are not familiar with the term, here is man page draft
> > for "pre-path" events:
> > https://github.com/amir73il/man-pages/commits/fan_pre_path/
> >
> > This is an out of date branch from the time that I called them
> > FAN_PRE_{CREATE,DELETE,MOVE_*} events:
> > https://github.com/amir73il/linux/commit/29c60e4db3068ff2cd7b2c5a73108afb2c19b868
> >
> > They are implemented by replacing the mnt_want_write() calls
> > with mnt_want_write_{path,parent,parents}() calls.
> >
> > This was done to make sure that they take the sb write srcu and call
> > the pre-path hook before taking sb writers freeze protection.
>
> Ok, so AFAIU you agree we don't need to rely on FMODE_NONOTIFY for HSM and
> can just use access through dedicated mount for filling in the filesystem?
>

It seems like a decent and simple way to avoid difficult questions,
so I will try to start with that...
whenever I manage to context switch back ;)

Thanks,
Amir.
Jan Kara March 4, 2024, 10:33 a.m. UTC | #19
On Tue 27-02-24 21:42:37, Amir Goldstein wrote:
> On Mon, Feb 19, 2024 at 1:01 PM Jan Kara <jack@suse.cz> wrote:
> >
> > On Thu 15-02-24 17:40:07, Amir Goldstein wrote:
> > > > > Last time we discussed this the conclusion was an API of a group-less
> > > > > default mask, for example:
> > > > >
> > > > > 1. fanotify_mark(FAN_GROUP_DEFAULT,
> > > > >                            FAN_MARK_ADD | FAN_MARK_MOUNT,
> > > > >                            FAN_PRE_ACCESS, AT_FDCWD, path);
> > > > > 2. this returns -EPERM for access until some group handles FAN_PRE_ACCESS
> > > > > 3. then HSM is started and subscribes to FAN_PRE_ACCESS
> > > > > 4. and then the mount is moved or bind mounted into a path exported to users
> > > >
> > > > Yes, this was the process I was talking about.
> > > >
> > > > > It is a simple solution that should be easy to implement.
> > > > > But it does not involve "register the HSM app with the filesystem",
> > > > > unless you mean that a process that opens an HSM group
> > > > > (FAN_REPORT_FID|FAN_CLASS_PRE_CONTENT) should automatically
> > > > > be given FMODE_NONOTIFY files?
> > > >
> > > > Two ideas: What you describe above seems like what the new mount API was
> > > > intended for? What if we introduced something like an "hsm" mount option
> > > > which would basically enable calling into pre-content event handlers
> > >
> > > I like that.
> > > I forgot that with my suggestion we'd need a path to setup
> > > the default mask.
> > >
> > > > (for sb without this flag handlers wouldn't be called and you cannot place
> > > > pre-content marks on such sb).
> > >
> > > IMO, that limitation (i.e. inside brackets) is too restrictive.
> > > In many cases, the user running HSM may not have control over the
> > > mount of the filesystem (inside containers?).
> > > It is true that HSM without anti-crash protection is less reliable,
> > > but I think that it is still useful enough that users will want the
> > > option to run it (?).
> > >
> > > Think of my HttpDirFS demo - it's just a simple lazy mirroring
> > > of a website. Even with low reliability I think it is useful (?).
> >
> > Yeah, ok, makes sense. But for such "unpriviledged" usecases we don't have
> > a deadlock-free way to fill in the file contents because that requires a
> > special mountpoint?
> 
> True, unless we also keep the FMODE_NONOTIFY event->fd
> for the simple cases. I'll need to think about this some more.

Well, but even creating new fds with FMODE_NONOTIFY or setting up fanotify
group with HSM events need to be somehow priviledged operation (currently
it requires CAP_SYS_ADMIN). So the more I think about it the less obvious
the "unpriviledged" usecase seems to be.

> > > > These handlers would return EACCESS unless
> > > > there's somebody handling events and returning something else.
> > > >
> > > > You could then do:
> > > >
> > > > fan_fd = fanotify_init()
> > > > ffd = fsopen()
> > > > fsconfig(ffd, FSCONFIG_SET_STRING, "source", device, 0)
> > > > fsconfig(ffd, FSCONFIG_SET_FLAG, "hsm", NULL, 0)
> > > > rootfd = fsconfig(ffd, FSCONFIG_CMD_CREATE, NULL, NULL, 0)
> > > > fanotify_mark(fan_fd, FAN_MARK_ADD, ... , rootfd, NULL)
> > > > <now you can move the superblock into the mount hierarchy>
> > >
> > > Not too bad.
> > > I think that "hsm_deny_mask=" mount options would give more flexibility,
> > > but I could be convinced otherwise.
> > >
> > > It's probably not a great idea to be running two different HSMs on the same
> > > fs anyway, but if user has an old HSM version installed that handles only
> > > pre-content events, I don't think that we want this old version if it happens
> > > to be run by mistake, to allow for unsupervised create,rename,delete if the
> > > admin wanted to atomically mount a fs that SHOULD be supervised by a
> > > v2 HSM that knows how to handle pre-path events.
> > >
> > > IOW, and "HSM bit" on sb is too broad IMO.
> >
> > OK. So "hsm_deny_mask=" would esentially express events that we require HSM
> > to handle, the rest would just be accepted by default. That makes sense.
> 
> Yes.
> 
> > The only thing I kind of dislike is that this ties fanotify API with mount
> > API. So perhaps hsm_deny_mask should be specified as a string? Like
> > preaccess,premodify,prelookup,... and transformed into a bitmask only
> > inside the kernel? It gives us more maneuvering space for the future.
> >
> 
> Urgh. I see what you are saying, but this seems so ugly to me.
> I have a strong feeling that we are trying to reinvent something
> and that we are going to reinvent it badly.
> I need to look for precedents, maybe in other OS.
> I believe that in Windows, there is an API to register as a
> Cloud Engine Provider, so there is probably a way to have multiple HSMs
> working on different sections of the filesystem in some reliable
> crash safe manner.

OK, let's see what other's have came up with :)

									Honza
Christian Brauner March 4, 2024, 12:06 p.m. UTC | #20
On Mon, Mar 04, 2024 at 11:33:37AM +0100, Jan Kara wrote:
> On Tue 27-02-24 21:42:37, Amir Goldstein wrote:
> > On Mon, Feb 19, 2024 at 1:01 PM Jan Kara <jack@suse.cz> wrote:
> > >
> > > On Thu 15-02-24 17:40:07, Amir Goldstein wrote:
> > > > > > Last time we discussed this the conclusion was an API of a group-less
> > > > > > default mask, for example:
> > > > > >
> > > > > > 1. fanotify_mark(FAN_GROUP_DEFAULT,
> > > > > >                            FAN_MARK_ADD | FAN_MARK_MOUNT,
> > > > > >                            FAN_PRE_ACCESS, AT_FDCWD, path);
> > > > > > 2. this returns -EPERM for access until some group handles FAN_PRE_ACCESS
> > > > > > 3. then HSM is started and subscribes to FAN_PRE_ACCESS
> > > > > > 4. and then the mount is moved or bind mounted into a path exported to users
> > > > >
> > > > > Yes, this was the process I was talking about.
> > > > >
> > > > > > It is a simple solution that should be easy to implement.
> > > > > > But it does not involve "register the HSM app with the filesystem",
> > > > > > unless you mean that a process that opens an HSM group
> > > > > > (FAN_REPORT_FID|FAN_CLASS_PRE_CONTENT) should automatically
> > > > > > be given FMODE_NONOTIFY files?
> > > > >
> > > > > Two ideas: What you describe above seems like what the new mount API was
> > > > > intended for? What if we introduced something like an "hsm" mount option
> > > > > which would basically enable calling into pre-content event handlers
> > > >
> > > > I like that.
> > > > I forgot that with my suggestion we'd need a path to setup
> > > > the default mask.
> > > >
> > > > > (for sb without this flag handlers wouldn't be called and you cannot place
> > > > > pre-content marks on such sb).
> > > >
> > > > IMO, that limitation (i.e. inside brackets) is too restrictive.
> > > > In many cases, the user running HSM may not have control over the
> > > > mount of the filesystem (inside containers?).
> > > > It is true that HSM without anti-crash protection is less reliable,
> > > > but I think that it is still useful enough that users will want the
> > > > option to run it (?).
> > > >
> > > > Think of my HttpDirFS demo - it's just a simple lazy mirroring
> > > > of a website. Even with low reliability I think it is useful (?).
> > >
> > > Yeah, ok, makes sense. But for such "unpriviledged" usecases we don't have
> > > a deadlock-free way to fill in the file contents because that requires a
> > > special mountpoint?
> > 
> > True, unless we also keep the FMODE_NONOTIFY event->fd
> > for the simple cases. I'll need to think about this some more.
> 
> Well, but even creating new fds with FMODE_NONOTIFY or setting up fanotify
> group with HSM events need to be somehow priviledged operation (currently
> it requires CAP_SYS_ADMIN). So the more I think about it the less obvious
> the "unpriviledged" usecase seems to be.
> 
> > > > > These handlers would return EACCESS unless
> > > > > there's somebody handling events and returning something else.
> > > > >
> > > > > You could then do:
> > > > >
> > > > > fan_fd = fanotify_init()
> > > > > ffd = fsopen()
> > > > > fsconfig(ffd, FSCONFIG_SET_STRING, "source", device, 0)
> > > > > fsconfig(ffd, FSCONFIG_SET_FLAG, "hsm", NULL, 0)
> > > > > rootfd = fsconfig(ffd, FSCONFIG_CMD_CREATE, NULL, NULL, 0)

Do you absolutely need that superblock to exist? It would probably be
better if you extended struct fs_context to carry the hsm information
and then you later attach it to the superblock once it's created.

> > > > > fanotify_mark(fan_fd, FAN_MARK_ADD, ... , rootfd, NULL)
> > > > > <now you can move the superblock into the mount hierarchy>
> > > >
> > > > Not too bad.
> > > > I think that "hsm_deny_mask=" mount options would give more flexibility,
> > > > but I could be convinced otherwise.
> > > >
> > > > It's probably not a great idea to be running two different HSMs on the same
> > > > fs anyway, but if user has an old HSM version installed that handles only
> > > > pre-content events, I don't think that we want this old version if it happens
> > > > to be run by mistake, to allow for unsupervised create,rename,delete if the
> > > > admin wanted to atomically mount a fs that SHOULD be supervised by a
> > > > v2 HSM that knows how to handle pre-path events.
> > > >
> > > > IOW, and "HSM bit" on sb is too broad IMO.
> > >
> > > OK. So "hsm_deny_mask=" would esentially express events that we require HSM
> > > to handle, the rest would just be accepted by default. That makes sense.
> > 
> > Yes.
> > 
> > > The only thing I kind of dislike is that this ties fanotify API with mount
> > > API. So perhaps hsm_deny_mask should be specified as a string? Like

Yeah. I don't consider that in itself a dealbreaker though. The messier
part is that the proposal above requires that the superblock exists
before you can set up fanotify. If you register an HSM during mount then
you should set everything as regular mount options before superblock
creation. If the superblock is created the HSM should go into effect
implicitly instead of requiring another fanotify call, imho.

But you could also do it the other way around:

fanotify_something_something(fan_fd, fs_context_fd, ...)
{
        fdget(fs_context_fd);
        if (f.file->f_op != &fscontext_fops)
                return -EINVAL;

        ret = mutex_lock_interruptible(&fc->uapi_mutex);
        if (ret == 0) {
                if (fc->phase != FS_CONTEXT_CREATE_PARAMS &&
                    fc->phase != FS_CONTEXT_RECONF_PARAMS)
                        return -EBUSY;

                /* do your thing */

                mutex_unlock(&fc->uapi_mutex);
        }
}

then you don't need to even use the mount api system calls and can go
through the regular fanotify system calls. You'd just need a sensible
helper that is exposed internally to fanotify to register events during
fscontext creation.

I'm not sure what would be better.

> > > preaccess,premodify,prelookup,... and transformed into a bitmask only
> > > inside the kernel? It gives us more maneuvering space for the future.
> > >
> > 
> > Urgh. I see what you are saying, but this seems so ugly to me.

Go look at what bpffs did for their delegation support... Exactly that iirc. :)

> > I have a strong feeling that we are trying to reinvent something
> > and that we are going to reinvent it badly.
> > I need to look for precedents, maybe in other OS.
> > I believe that in Windows, there is an API to register as a
> > Cloud Engine Provider, so there is probably a way to have multiple HSMs
> > working on different sections of the filesystem in some reliable
> > crash safe manner.
> 
> OK, let's see what other's have came up with :)
Amir Goldstein April 15, 2024, 2:23 p.m. UTC | #21
On Mon, Mar 4, 2024 at 12:33 PM Jan Kara <jack@suse.cz> wrote:
>
> On Tue 27-02-24 21:42:37, Amir Goldstein wrote:
> > On Mon, Feb 19, 2024 at 1:01 PM Jan Kara <jack@suse.cz> wrote:
> > >
> > > On Thu 15-02-24 17:40:07, Amir Goldstein wrote:
> > > > > > Last time we discussed this the conclusion was an API of a group-less
> > > > > > default mask, for example:
> > > > > >
> > > > > > 1. fanotify_mark(FAN_GROUP_DEFAULT,
> > > > > >                            FAN_MARK_ADD | FAN_MARK_MOUNT,
> > > > > >                            FAN_PRE_ACCESS, AT_FDCWD, path);
> > > > > > 2. this returns -EPERM for access until some group handles FAN_PRE_ACCESS
> > > > > > 3. then HSM is started and subscribes to FAN_PRE_ACCESS
> > > > > > 4. and then the mount is moved or bind mounted into a path exported to users
> > > > >
> > > > > Yes, this was the process I was talking about.
> > > > >
> > > > > > It is a simple solution that should be easy to implement.
> > > > > > But it does not involve "register the HSM app with the filesystem",
> > > > > > unless you mean that a process that opens an HSM group
> > > > > > (FAN_REPORT_FID|FAN_CLASS_PRE_CONTENT) should automatically
> > > > > > be given FMODE_NONOTIFY files?
> > > > >
> > > > > Two ideas: What you describe above seems like what the new mount API was
> > > > > intended for? What if we introduced something like an "hsm" mount option
> > > > > which would basically enable calling into pre-content event handlers
> > > >
> > > > I like that.
> > > > I forgot that with my suggestion we'd need a path to setup
> > > > the default mask.
> > > >
> > > > > (for sb without this flag handlers wouldn't be called and you cannot place
> > > > > pre-content marks on such sb).
> > > >
> > > > IMO, that limitation (i.e. inside brackets) is too restrictive.
> > > > In many cases, the user running HSM may not have control over the
> > > > mount of the filesystem (inside containers?).
> > > > It is true that HSM without anti-crash protection is less reliable,
> > > > but I think that it is still useful enough that users will want the
> > > > option to run it (?).
> > > >
> > > > Think of my HttpDirFS demo - it's just a simple lazy mirroring
> > > > of a website. Even with low reliability I think it is useful (?).
> > >
> > > Yeah, ok, makes sense. But for such "unpriviledged" usecases we don't have
> > > a deadlock-free way to fill in the file contents because that requires a
> > > special mountpoint?
> >
> > True, unless we also keep the FMODE_NONOTIFY event->fd
> > for the simple cases. I'll need to think about this some more.
>
> Well, but even creating new fds with FMODE_NONOTIFY or setting up fanotify
> group with HSM events need to be somehow priviledged operation (currently
> it requires CAP_SYS_ADMIN). So the more I think about it the less obvious
> the "unpriviledged" usecase seems to be.
>

ok. Let's put this one on ice for now.

> > > > > These handlers would return EACCESS unless
> > > > > there's somebody handling events and returning something else.
> > > > >
> > > > > You could then do:
> > > > >
> > > > > fan_fd = fanotify_init()
> > > > > ffd = fsopen()
> > > > > fsconfig(ffd, FSCONFIG_SET_STRING, "source", device, 0)
> > > > > fsconfig(ffd, FSCONFIG_SET_FLAG, "hsm", NULL, 0)
> > > > > rootfd = fsconfig(ffd, FSCONFIG_CMD_CREATE, NULL, NULL, 0)
> > > > > fanotify_mark(fan_fd, FAN_MARK_ADD, ... , rootfd, NULL)
> > > > > <now you can move the superblock into the mount hierarchy>
> > > >
> > > > Not too bad.
> > > > I think that "hsm_deny_mask=" mount options would give more flexibility,
> > > > but I could be convinced otherwise.
> > > >
> > > > It's probably not a great idea to be running two different HSMs on the same
> > > > fs anyway, but if user has an old HSM version installed that handles only
> > > > pre-content events, I don't think that we want this old version if it happens
> > > > to be run by mistake, to allow for unsupervised create,rename,delete if the
> > > > admin wanted to atomically mount a fs that SHOULD be supervised by a
> > > > v2 HSM that knows how to handle pre-path events.
> > > >
> > > > IOW, and "HSM bit" on sb is too broad IMO.
> > >
> > > OK. So "hsm_deny_mask=" would esentially express events that we require HSM
> > > to handle, the rest would just be accepted by default. That makes sense.
> >
> > Yes.
> >
> > > The only thing I kind of dislike is that this ties fanotify API with mount
> > > API. So perhaps hsm_deny_mask should be specified as a string? Like
> > > preaccess,premodify,prelookup,... and transformed into a bitmask only
> > > inside the kernel? It gives us more maneuvering space for the future.
> > >
> >
> > Urgh. I see what you are saying, but this seems so ugly to me.
> > I have a strong feeling that we are trying to reinvent something
> > and that we are going to reinvent it badly.
> > I need to look for precedents, maybe in other OS.
> > I believe that in Windows, there is an API to register as a
> > Cloud Engine Provider, so there is probably a way to have multiple HSMs
> > working on different sections of the filesystem in some reliable
> > crash safe manner.
>
> OK, let's see what other's have came up with :)

From my very basic Google research (did not ask Chat GPT yet ;))
I think that MacOS FSEvents do not have blocking permission events at all,
so there is no built-in HSM API.

The Windows Cloud Sync Engine API:
https://learn.microsoft.com/en-us/windows/win32/cfapi/build-a-cloud-file-sync-engine
Does allow registring different "Storage namespace providers".
AFAICT, the persistence of "Place holder" files is based on NTFS
"Reparse points",
which are a long time native concept which allows registering a persistent
hook on a file to be handled by a specific Windows driver.

So for example, a Dropbox place holder file, is a file with "reparse point"
that has some label to direct the read/write calls to the Windows
Cloud Sync Engine
driver and a sub-label to direct the handling of the upcall by the Dropbox
CloudSync Engine service.

I do not want to deal with "persistent fanotify marks" at this time, so maybe
something like:

fsconfig(ffd, FSCONFIG_SET_STRING, "hsmid", "dropbox", 0)
fsconfig(ffd, FSCONFIG_SET_STRING, "hsmver", "1", 0)

Add support ioctls in fanotify_ioctl():
- FANOTIFY_IOC_HSMID
- FANOTIFY_IOC_HSMVER

And require that a group with matching hsmid and recent hsmver has a live
filesystem mark on the sb.

If this is an acceptable API for a single crash-safe HSM provider, then the
question becomes:
How would we extend this to multiple crash-safe HSM providers in the future?

Or maybe we do not need to support multiple HSM groups per sb?
Maybe in the future a generic service could be implemented to
delegate different HSM modules, e.g.:

fsconfig(ffd, FSCONFIG_SET_STRING, "hsmid", "cloudsync", 0)
fsconfig(ffd, FSCONFIG_SET_STRING, "hsmver", "1", 0)

And a generic "cloudsync" service could be in charge of
registration of "cloudsync" engines and dispatching the pre-content
event to the appropriate module based on path (i.e. under the dropbox folder).

Once this gets passed NACKs from fs developers I'd like to pull in
some distro people to the discussion and maybe bring this up as
a topic discussion for LSFMM if we feel that there is something to discuss.

Thoughts?

Amir.
Jan Kara April 16, 2024, 3:15 p.m. UTC | #22
On Mon 15-04-24 17:23:40, Amir Goldstein wrote:
> On Mon, Mar 4, 2024 at 12:33 PM Jan Kara <jack@suse.cz> wrote:
> > On Tue 27-02-24 21:42:37, Amir Goldstein wrote:
> > > On Mon, Feb 19, 2024 at 1:01 PM Jan Kara <jack@suse.cz> wrote:
> > > > > > These handlers would return EACCESS unless
> > > > > > there's somebody handling events and returning something else.
> > > > > >
> > > > > > You could then do:
> > > > > >
> > > > > > fan_fd = fanotify_init()
> > > > > > ffd = fsopen()
> > > > > > fsconfig(ffd, FSCONFIG_SET_STRING, "source", device, 0)
> > > > > > fsconfig(ffd, FSCONFIG_SET_FLAG, "hsm", NULL, 0)
> > > > > > rootfd = fsconfig(ffd, FSCONFIG_CMD_CREATE, NULL, NULL, 0)
> > > > > > fanotify_mark(fan_fd, FAN_MARK_ADD, ... , rootfd, NULL)
> > > > > > <now you can move the superblock into the mount hierarchy>
> > > > >
> > > > > Not too bad.
> > > > > I think that "hsm_deny_mask=" mount options would give more flexibility,
> > > > > but I could be convinced otherwise.
> > > > >
> > > > > It's probably not a great idea to be running two different HSMs on the same
> > > > > fs anyway, but if user has an old HSM version installed that handles only
> > > > > pre-content events, I don't think that we want this old version if it happens
> > > > > to be run by mistake, to allow for unsupervised create,rename,delete if the
> > > > > admin wanted to atomically mount a fs that SHOULD be supervised by a
> > > > > v2 HSM that knows how to handle pre-path events.
> > > > >
> > > > > IOW, and "HSM bit" on sb is too broad IMO.
> > > >
> > > > OK. So "hsm_deny_mask=" would esentially express events that we require HSM
> > > > to handle, the rest would just be accepted by default. That makes sense.
> > >
> > > Yes.
> > >
> > > > The only thing I kind of dislike is that this ties fanotify API with mount
> > > > API. So perhaps hsm_deny_mask should be specified as a string? Like
> > > > preaccess,premodify,prelookup,... and transformed into a bitmask only
> > > > inside the kernel? It gives us more maneuvering space for the future.
> > > >
> > >
> > > Urgh. I see what you are saying, but this seems so ugly to me.
> > > I have a strong feeling that we are trying to reinvent something
> > > and that we are going to reinvent it badly.
> > > I need to look for precedents, maybe in other OS.
> > > I believe that in Windows, there is an API to register as a
> > > Cloud Engine Provider, so there is probably a way to have multiple HSMs
> > > working on different sections of the filesystem in some reliable
> > > crash safe manner.
> >
> > OK, let's see what other's have came up with :)
> 
> From my very basic Google research (did not ask Chat GPT yet ;))
> I think that MacOS FSEvents do not have blocking permission events at all,
> so there is no built-in HSM API.
> 
> The Windows Cloud Sync Engine API:
> https://learn.microsoft.com/en-us/windows/win32/cfapi/build-a-cloud-file-sync-engine
> Does allow registring different "Storage namespace providers".
> AFAICT, the persistence of "Place holder" files is based on NTFS
> "Reparse points",
> which are a long time native concept which allows registering a persistent
> hook on a file to be handled by a specific Windows driver.
> 
> So for example, a Dropbox place holder file, is a file with "reparse point"
> that has some label to direct the read/write calls to the Windows
> Cloud Sync Engine
> driver and a sub-label to direct the handling of the upcall by the Dropbox
> CloudSync Engine service.

OK, so AFAIU they implement HSM directly in the filesystem which is somewhat
different situation from what we are trying to do.

> I do not want to deal with "persistent fanotify marks" at this time, so
> maybe something like:
> 
> fsconfig(ffd, FSCONFIG_SET_STRING, "hsmid", "dropbox", 0)
> fsconfig(ffd, FSCONFIG_SET_STRING, "hsmver", "1", 0)
> 
> Add support ioctls in fanotify_ioctl():
> - FANOTIFY_IOC_HSMID
> - FANOTIFY_IOC_HSMVER

What would these do? Set HSMID & HSMVER for fsnotify_group identified by
'file'? BTW I'm not so convinced about the 'version' thing. I have hard
time to remember an example where the versioning in the API actually ended
up being useful. I also expect tight coupling between userspace mounting
the filesystem and setting up HSM so it is hard to imagine some wrong
version of HSM provider would be "accidentally" started for the
filesystem.

> And require that a group with matching hsmid and recent hsmver has a live
> filesystem mark on the sb.

I'm not quite following here. We'd require matching fsnotify group for
what? For mounting the fs? For not returning EPERM from all pre-op
handlers? Either way that doesn't make sense to me as its unclear how
userspace would be able to place the mark... But there's a way around that
- since the HSM app will have its private non-HSM mount for filling in
contents, it can first create that mount, place filesystem marks though
it and then remount the superblock with hsmid mount option and create the
public mount. But I'm not sure if you meant this or something else...

> If this is an acceptable API for a single crash-safe HSM provider, then the
> question becomes:
> How would we extend this to multiple crash-safe HSM providers in the future?

Something like:

fsconfig(ffd, FSCONFIG_SET_STRING, "hsmid", "dropbox,cloudsync,httpfs", 0)

means all of them are required to have a filesystem mark?
 
> Or maybe we do not need to support multiple HSM groups per sb?
> Maybe in the future a generic service could be implemented to
> delegate different HSM modules, e.g.:
> 
> fsconfig(ffd, FSCONFIG_SET_STRING, "hsmid", "cloudsync", 0)
> fsconfig(ffd, FSCONFIG_SET_STRING, "hsmver", "1", 0)
> 
> And a generic "cloudsync" service could be in charge of
> registration of "cloudsync" engines and dispatching the pre-content
> event to the appropriate module based on path (i.e. under the dropbox folder).
> 
> Once this gets passed NACKs from fs developers I'd like to pull in
> some distro people to the discussion and maybe bring this up as
> a topic discussion for LSFMM if we feel that there is something to discuss.

I guess a short talk (lighting talk?) what we are planning to do would be
interesting so that people are aware. At this point I don't think we have
particular problems to discuss that would be interesting for the whole fs
crowd for a full slot...

								Honza
Amir Goldstein April 16, 2024, 3:52 p.m. UTC | #23
> > The Windows Cloud Sync Engine API:
> > https://learn.microsoft.com/en-us/windows/win32/cfapi/build-a-cloud-file-sync-engine
> > Does allow registring different "Storage namespace providers".
> > AFAICT, the persistence of "Place holder" files is based on NTFS
> > "Reparse points",
> > which are a long time native concept which allows registering a persistent
> > hook on a file to be handled by a specific Windows driver.
> >
> > So for example, a Dropbox place holder file, is a file with "reparse point"
> > that has some label to direct the read/write calls to the Windows
> > Cloud Sync Engine
> > driver and a sub-label to direct the handling of the upcall by the Dropbox
> > CloudSync Engine service.
>
> OK, so AFAIU they implement HSM directly in the filesystem which is somewhat
> different situation from what we are trying to do.
>

Technically, I think that Reparse points Win32 driver hooks are a
generic WIN32 API which NTFS implements, but that doesn't matter.
IIUC, it is equivalent to having support for xattr "security.hsm.dropbox"
that fanotify would know how to intercept as a persistent mark to
be handled by "dropbox" hsm group or return EPERM.

> > I do not want to deal with "persistent fanotify marks" at this time, so
> > maybe something like:
> >
> > fsconfig(ffd, FSCONFIG_SET_STRING, "hsmid", "dropbox", 0)
> > fsconfig(ffd, FSCONFIG_SET_STRING, "hsmver", "1", 0)
> >
> > Add support ioctls in fanotify_ioctl():
> > - FANOTIFY_IOC_HSMID
> > - FANOTIFY_IOC_HSMVER
>
> What would these do? Set HSMID & HSMVER for fsnotify_group identified by
> 'file'? BTW I'm not so convinced about the 'version' thing. I have hard
> time to remember an example where the versioning in the API actually ended
> up being useful. I also expect tight coupling between userspace mounting
> the filesystem and setting up HSM so it is hard to imagine some wrong
> version of HSM provider would be "accidentally" started for the
> filesystem.
>

ok. worse case, can alway switch to hsmid "dropboxv2"

> > And require that a group with matching hsmid and recent hsmver has a live
> > filesystem mark on the sb.
>
> I'm not quite following here. We'd require matching fsnotify group for
> what? For mounting the fs? For not returning EPERM from all pre-op
> handlers? Either way that doesn't make sense to me as its unclear how
> userspace would be able to place the mark... But there's a way around that
> - since the HSM app will have its private non-HSM mount for filling in
> contents, it can first create that mount, place filesystem marks though
> it and then remount the superblock with hsmid mount option and create the
> public mount. But I'm not sure if you meant this or something else...
>

I haven't thought of the mechanics yet just the definition:
- An sb with hsm="XXX" returns EPERM for pre-content events
  unless there is an sb mark from a group that is identified as hsm "XXX"

I don't see a problem with mounting the fs first and only then
setting up the sb mark on the root of the fs (which does not require
a pre-lookup event). When the hsm service is restarted, it is going to
need to re-set the sb mark on the hsm="XXX" sb anyway.

> > If this is an acceptable API for a single crash-safe HSM provider, then the
> > question becomes:
> > How would we extend this to multiple crash-safe HSM providers in the future?
>
> Something like:
>
> fsconfig(ffd, FSCONFIG_SET_STRING, "hsmid", "dropbox,cloudsync,httpfs", 0)
>
> means all of them are required to have a filesystem mark?
>

Yeh, it's an option.
I have a trauma from comma separated values in overlayfs
mount options, but maybe it's fine.
The main API question would be, regardless of single or multi hsm,
whether hsm="" value should be reconfigurable (probably yes).

> > Or maybe we do not need to support multiple HSM groups per sb?
> > Maybe in the future a generic service could be implemented to
> > delegate different HSM modules, e.g.:
> >
> > fsconfig(ffd, FSCONFIG_SET_STRING, "hsmid", "cloudsync", 0)
> > fsconfig(ffd, FSCONFIG_SET_STRING, "hsmver", "1", 0)
> >
> > And a generic "cloudsync" service could be in charge of
> > registration of "cloudsync" engines and dispatching the pre-content
> > event to the appropriate module based on path (i.e. under the dropbox folder).
> >
> > Once this gets passed NACKs from fs developers I'd like to pull in
> > some distro people to the discussion and maybe bring this up as
> > a topic discussion for LSFMM if we feel that there is something to discuss.
>
> I guess a short talk (lighting talk?) what we are planning to do would be
> interesting so that people are aware. At this point I don't think we have
> particular problems to discuss that would be interesting for the whole fs
> crowd for a full slot...

Agree.

Thanks,
Amir.
diff mbox series

Patch

diff --git a/fs/notify/fanotify/fanotify.c b/fs/notify/fanotify/fanotify.c
index 1e4def21811e..87798fdc06e7 100644
--- a/fs/notify/fanotify/fanotify.c
+++ b/fs/notify/fanotify/fanotify.c
@@ -256,18 +256,21 @@  static int fanotify_get_response(struct fsnotify_group *group,
 	}
 
 	/* userspace responded, convert to something usable */
-	switch (event->response & FANOTIFY_RESPONSE_ACCESS) {
+	switch (FAN_RESPONSE_ACCESS(event->response)) {
 	case FAN_ALLOW:
 		ret = 0;
 		break;
 	case FAN_DENY:
+		/* userspace can provide errno other than EPERM */
+		ret = -(FAN_RESPONSE_ERRNO(event->response) ?: EPERM);
+		break;
 	default:
 		ret = -EPERM;
 	}
 
 	/* Check if the response should be audited */
 	if (event->response & FAN_AUDIT)
-		audit_fanotify(event->response & ~FAN_AUDIT,
+		audit_fanotify(FAN_RESPONSE_ACCESS(event->response),
 			       &event->audit_rule);
 
 	pr_debug("%s: group=%p event=%p about to return ret=%d\n", __func__,
diff --git a/fs/notify/fanotify/fanotify_user.c b/fs/notify/fanotify/fanotify_user.c
index f83e7cc5ccf2..6b17d33a06aa 100644
--- a/fs/notify/fanotify/fanotify_user.c
+++ b/fs/notify/fanotify/fanotify_user.c
@@ -336,11 +336,12 @@  static int process_access_response(struct fsnotify_group *group,
 	struct fanotify_perm_event *event;
 	int fd = response_struct->fd;
 	u32 response = response_struct->response;
+	int errno = FAN_RESPONSE_ERRNO(response);
 	int ret = info_len;
 	struct fanotify_response_info_audit_rule friar;
 
-	pr_debug("%s: group=%p fd=%d response=%u buf=%p size=%zu\n", __func__,
-		 group, fd, response, info, info_len);
+	pr_debug("%s: group=%p fd=%d response=%x errno=%d buf=%p size=%zu\n",
+		 __func__, group, fd, response, errno, info, info_len);
 	/*
 	 * make sure the response is valid, if invalid we do nothing and either
 	 * userspace can send a valid response or we will clean it up after the
@@ -349,8 +350,11 @@  static int process_access_response(struct fsnotify_group *group,
 	if (response & ~FANOTIFY_RESPONSE_VALID_MASK)
 		return -EINVAL;
 
-	switch (response & FANOTIFY_RESPONSE_ACCESS) {
+	switch (FAN_RESPONSE_ACCESS(response)) {
 	case FAN_ALLOW:
+		if (errno)
+			return -EINVAL;
+		break;
 	case FAN_DENY:
 		break;
 	default:
diff --git a/include/linux/fanotify.h b/include/linux/fanotify.h
index 4f1c4f603118..2004a4d3b1dc 100644
--- a/include/linux/fanotify.h
+++ b/include/linux/fanotify.h
@@ -125,7 +125,14 @@ 
 /* These masks check for invalid bits in permission responses. */
 #define FANOTIFY_RESPONSE_ACCESS (FAN_ALLOW | FAN_DENY)
 #define FANOTIFY_RESPONSE_FLAGS (FAN_AUDIT | FAN_INFO)
-#define FANOTIFY_RESPONSE_VALID_MASK (FANOTIFY_RESPONSE_ACCESS | FANOTIFY_RESPONSE_FLAGS)
+#define FANOTIFY_RESPONSE_ERRNO	(FAN_ERRNO_MASK << FAN_ERRNO_SHIFT)
+#define FANOTIFY_RESPONSE_VALID_MASK \
+	(FANOTIFY_RESPONSE_ACCESS | FANOTIFY_RESPONSE_FLAGS | \
+	 FANOTIFY_RESPONSE_ERRNO)
+
+/* errno other than EPERM can specified in upper byte of deny response */
+#define FAN_RESPONSE_ACCESS(res)	((res) & FANOTIFY_RESPONSE_ACCESS)
+#define FAN_RESPONSE_ERRNO(res)		((int)((res) >> FAN_ERRNO_SHIFT))
 
 /* Do not use these old uapi constants internally */
 #undef FAN_ALL_CLASS_BITS
diff --git a/include/uapi/linux/fanotify.h b/include/uapi/linux/fanotify.h
index cd14c94e9a1e..a8bb25c99dd1 100644
--- a/include/uapi/linux/fanotify.h
+++ b/include/uapi/linux/fanotify.h
@@ -223,6 +223,13 @@  struct fanotify_response_info_audit_rule {
 /* Legit userspace responses to a _PERM event */
 #define FAN_ALLOW	0x01
 #define FAN_DENY	0x02
+/* errno other than EPERM can specified in upper byte of deny response */
+#define FAN_ERRNO_BITS	8
+#define FAN_ERRNO_SHIFT (32 - FAN_ERRNO_BITS)
+#define FAN_ERRNO_MASK	((1 << FAN_ERRNO_BITS) - 1)
+#define FAN_DENY_ERRNO(err) \
+	(FAN_DENY | ((((__u32)(err)) & FAN_ERRNO_MASK) << FAN_ERRNO_SHIFT))
+
 #define FAN_AUDIT	0x10	/* Bitmask to create audit record for result */
 #define FAN_INFO	0x20	/* Bitmask to indicate additional information */