diff mbox series

[v5,18/23] fanotify: Handle FAN_FS_ERROR events

Message ID 20210804160612.3575505-19-krisman@collabora.com (mailing list archive)
State New, archived
Headers show
Series File system wide monitoring | expand

Commit Message

Gabriel Krisman Bertazi Aug. 4, 2021, 4:06 p.m. UTC
Wire up FAN_FS_ERROR in the fanotify_mark syscall.  The event can only
be requested for the entire filesystem, thus it requires the
FAN_MARK_FILESYSTEM.

FAN_FS_ERROR has to be handled slightly differently from other events
because it needs to be submitted in an atomic context, using
preallocated memory.  This patch implements the submission path by only
storing the first error event that happened in the slot (userspace
resets the slot by reading the event).

Extra error events happening when the slot is occupied are merged to the
original report, and the only information keep for these extra errors is
an accumulator counting the number of events, which is part of the
record reported back to userspace.

Reporting only the first event should be fine, since when a FS error
happens, a cascade of error usually follows, but the most meaningful
information is (usually) on the first erro.

The event dequeueing is also a bit special to avoid losing events. Since
event merging only happens while the event is queued, there is a window
between when an error event is dequeued (notification_lock is dropped)
until it is reset (.free_event()) where the slot is full, but no merges
can happen.

The proposed solution is to replace the event in the slot with a new
structure, prior to dropping the lock.  This way, if a new event arrives
in the time between the event was dequeued and the time it resets, the
new errors will still be logged and merged in the new slot.

Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>

---
Changes since v4:
  - Split parts to earlier patches (amir)
  - Simplify fanotify entry replacement
  - Update handle size prediction on overflow
Changes since v3:
  - Convert WARN_ON to pr_warn (amir)
  - Remove unecessary READ/WRITE_ONCE (amir)
  - Alloc with GFP_KERNEL_ACCOUNT(amir)
  - Simplify flags on mark allocation (amir)
  - Avoid atomic set of error_count (amir)
  - Simplify rules when merging error_event (amir)
  - Allocate new error_event on get_one_event (amir)
  - Report superblock error with invalid FH (amir,jan)

Changes since v2:
  - Support and equire FID mode (amir)
  - Goto error path instead of early return (amir)
  - Simplify get_one_event (me)
  - Base merging on error_count
  - drop fanotify_queue_error_event

Changes since v1:
  - Pass dentry to fanotify_check_fsid (Amir)
  - FANOTIFY_EVENT_TYPE_ERROR -> FANOTIFY_EVENT_TYPE_FS_ERROR
  - Merge previous patch into it
  - Use a single slot
  - Move fanotify_mark.error_event definition to this commit
  - Rename FAN_ERROR -> FAN_FS_ERROR
  - Restrict FAN_FS_ERROR to FAN_MARK_FILESYSTEM
---
 fs/notify/fanotify/fanotify.c      | 50 +++++++++++++++++++++++-
 fs/notify/fanotify/fanotify.h      |  9 +++++
 fs/notify/fanotify/fanotify_user.c | 63 +++++++++++++++++++++++++++++-
 include/linux/fanotify.h           |  6 ++-
 4 files changed, 124 insertions(+), 4 deletions(-)

Comments

Jan Kara Aug. 5, 2021, 12:15 p.m. UTC | #1
On Wed 04-08-21 12:06:07, Gabriel Krisman Bertazi wrote:
> Wire up FAN_FS_ERROR in the fanotify_mark syscall.  The event can only
> be requested for the entire filesystem, thus it requires the
> FAN_MARK_FILESYSTEM.
> 
> FAN_FS_ERROR has to be handled slightly differently from other events
> because it needs to be submitted in an atomic context, using
> preallocated memory.  This patch implements the submission path by only
> storing the first error event that happened in the slot (userspace
> resets the slot by reading the event).
> 
> Extra error events happening when the slot is occupied are merged to the
> original report, and the only information keep for these extra errors is
> an accumulator counting the number of events, which is part of the
> record reported back to userspace.
> 
> Reporting only the first event should be fine, since when a FS error
> happens, a cascade of error usually follows, but the most meaningful
> information is (usually) on the first erro.
> 
> The event dequeueing is also a bit special to avoid losing events. Since
> event merging only happens while the event is queued, there is a window
> between when an error event is dequeued (notification_lock is dropped)
> until it is reset (.free_event()) where the slot is full, but no merges
> can happen.
> 
> The proposed solution is to replace the event in the slot with a new
> structure, prior to dropping the lock.  This way, if a new event arrives
> in the time between the event was dequeued and the time it resets, the
> new errors will still be logged and merged in the new slot.
> 
> Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>

The splitting of the patches really helped. Now I think I can grok much
more details than before :) Thanks! Some comments below.

> diff --git a/fs/notify/fanotify/fanotify.c b/fs/notify/fanotify/fanotify.c
> index 0678d35432a7..4e9e271a4394 100644
> --- a/fs/notify/fanotify/fanotify.c
> +++ b/fs/notify/fanotify/fanotify.c
> @@ -681,6 +681,42 @@ static __kernel_fsid_t fanotify_get_fsid(struct fsnotify_iter_info *iter_info)
>  	return fsid;
>  }
>  
> +static int fanotify_merge_error_event(struct fsnotify_group *group,
> +				      struct fsnotify_event *event)
> +{
> +	struct fanotify_event *fae = FANOTIFY_E(event);
> +	struct fanotify_error_event *fee = FANOTIFY_EE(fae);
> +
> +	/*
> +	 * When err_count > 0, the reporting slot is full.  Just account
> +	 * the additional error and abort the insertion.
> +	 */
> +	if (fee->err_count) {
> +		fee->err_count++;
> +		return 1;
> +	}
> +
> +	return 0;
> +}
> +
> +static void fanotify_insert_error_event(struct fsnotify_group *group,
> +					struct fsnotify_event *event,
> +					const void *data)
> +{
> +	const struct fs_error_report *report = (struct fs_error_report *) data;
> +	struct fanotify_event *fae = FANOTIFY_E(event);
> +	struct fanotify_error_event *fee;
> +
> +	/* This might be an unexpected type of event (i.e. overflow). */
> +	if (!fanotify_is_error_event(fae->mask))
> +		return;
> +
> +	fee = FANOTIFY_EE(fae);
> +	fee->fae.type = FANOTIFY_EVENT_TYPE_FS_ERROR;
> +	fee->error = report->error;
> +	fee->err_count = 1;
> +}
> +
>  /*
>   * Add an event to hash table for faster merge.
>   */
> @@ -735,7 +771,7 @@ static int fanotify_handle_event(struct fsnotify_group *group, u32 mask,
>  	BUILD_BUG_ON(FAN_OPEN_EXEC_PERM != FS_OPEN_EXEC_PERM);
>  	BUILD_BUG_ON(FAN_FS_ERROR != FS_ERROR);
>  
> -	BUILD_BUG_ON(HWEIGHT32(ALL_FANOTIFY_EVENT_BITS) != 19);
> +	BUILD_BUG_ON(HWEIGHT32(ALL_FANOTIFY_EVENT_BITS) != 20);
>  
>  	mask = fanotify_group_event_mask(group, iter_info, mask, data,
>  					 data_type, dir);
> @@ -760,6 +796,18 @@ static int fanotify_handle_event(struct fsnotify_group *group, u32 mask,
>  			return 0;
>  	}
>  
> +	if (fanotify_is_error_event(mask)) {
> +		struct fanotify_sb_mark *sb_mark =
> +			FANOTIFY_SB_MARK(fsnotify_iter_sb_mark(iter_info));
> +
> +		ret = fsnotify_insert_event(group,
> +					    &sb_mark->fee_slot->fae.fse,
> +					    fanotify_merge_error_event,
> +					    fanotify_insert_error_event,
> +					    data);
> +		goto finish;
> +	}

Hum, seeing this and how you had to extend fsnotify_add_event() to
accommodate this use, cannot we instead have something like:

	if (fanotify_is_error_event(mask)) {
		struct fanotify_sb_mark *sb_mark =
			FANOTIFY_SB_MARK(fsnotify_iter_sb_mark(iter_info));
		struct fanotify_error_event *event = &sb_mark->fee_slot;
		bool queue = false;

		spin_lock(&group->notification_lock);
		/* Not yet queued? */
		if (!event->err_count) {
			fee->error = report->error;
			queue = true;
		}
		event->err_count++;
		spin_unlock(&group->notification_lock);
		if (queue) {
			... fill in other error info in 'event' such as fhandle
			fsnotify_add_event(group, &event->fae.fse, NULL);
		}
	}

It would be IMHO simpler to follow what's going on and we don't have to
touch fsnotify_add_event(). I do recognize that due to races it may happen
that some racing fsnotify(FAN_FS_ERROR) call returns before the event is
actually visible in the event queue. It don't think it really matters but
if we wanted to be more careful, we would need to preformat fhandle into a
local buffer and only copy it into the event under notification_lock when
we see the event is unused.

> +/*
> + * Replace a mark's error event with a new structure in preparation for
> + * it to be dequeued.  This is a bit annoying since we need to drop the
> + * lock, so another thread might just steal the event from us.
> + */
> +static int fanotify_replace_fs_error_event(struct fsnotify_group *group,
> +					   struct fanotify_event *fae)
> +{
> +	struct fanotify_error_event *new, *fee = FANOTIFY_EE(fae);
> +	struct fanotify_sb_mark *sb_mark = fee->sb_mark;
> +	struct fsnotify_event *fse;
> +
> +	pr_debug("%s: event=%p\n", __func__, fae);
> +
> +	assert_spin_locked(&group->notification_lock);
> +
> +	spin_unlock(&group->notification_lock);
> +	new = fanotify_alloc_error_event(sb_mark);
> +	spin_lock(&group->notification_lock);
> +
> +	if (!new)
> +		return -ENOMEM;
> +
> +	/*
> +	 * Since we temporarily dropped the notification_lock, the event
> +	 * might have been taken from under us and reported by another
> +	 * reader.  If that is the case, don't play games, just retry.
> +	 */
> +	fse = fsnotify_peek_first_event(group);
> +	if (fse != &fae->fse) {
> +		kfree(new);
> +		return -EAGAIN;
> +	}
> +
> +	sb_mark->fee_slot = new;
> +
> +	return 0;
> +}
> +
>  /*
>   * Get an fanotify notification event if one exists and is small
>   * enough to fit in "count". Return an error pointer if the count
> @@ -212,9 +252,21 @@ static struct fanotify_event *get_one_event(struct fsnotify_group *group,
>  		goto out;
>  	}
>  
> +	if (fanotify_is_error_event(event->mask)) {
> +		/*
> +		 * Replace the error event ahead of dequeueing so we
> +		 * don't need to handle a incorrectly dequeued event.
> +		 */
> +		ret = fanotify_replace_fs_error_event(group, event);
> +		if (ret) {
> +			event = ERR_PTR(ret);
> +			goto out;
> +		}
> +	}
> +

The replacing, retry, and all is hairy. Cannot we just keep the same event
attached to the sb mark and copy-out to on-stack buffer under
notification_lock in get_one_event()? The event is big (due to fhandle) but
fanotify_read() is not called from a deep call chain so we should have
enough space on stack for that.

>  	/*
> -	 * Held the notification_lock the whole time, so this is the
> -	 * same event we peeked above.
> +	 * Even though we might have temporarily dropped the lock, this
> +	 * is guaranteed to be the same event we peeked above.
>  	 */
>  	fsnotify_remove_first_event(group);
>  	if (fanotify_is_perm_event(event->mask))
> @@ -596,6 +648,8 @@ static ssize_t fanotify_read(struct file *file, char __user *buf,
>  		event = get_one_event(group, count);
>  		if (IS_ERR(event)) {
>  			ret = PTR_ERR(event);
> +			if (ret == -EAGAIN)
> +				continue;
>  			break;
>  		}
>  

								Honza
Amir Goldstein Aug. 5, 2021, 1:50 p.m. UTC | #2
On Thu, Aug 5, 2021 at 3:15 PM Jan Kara <jack@suse.cz> wrote:
>
> On Wed 04-08-21 12:06:07, Gabriel Krisman Bertazi wrote:
> > Wire up FAN_FS_ERROR in the fanotify_mark syscall.  The event can only
> > be requested for the entire filesystem, thus it requires the
> > FAN_MARK_FILESYSTEM.
> >
> > FAN_FS_ERROR has to be handled slightly differently from other events
> > because it needs to be submitted in an atomic context, using
> > preallocated memory.  This patch implements the submission path by only
> > storing the first error event that happened in the slot (userspace
> > resets the slot by reading the event).
> >
> > Extra error events happening when the slot is occupied are merged to the
> > original report, and the only information keep for these extra errors is
> > an accumulator counting the number of events, which is part of the
> > record reported back to userspace.
> >
> > Reporting only the first event should be fine, since when a FS error
> > happens, a cascade of error usually follows, but the most meaningful
> > information is (usually) on the first erro.
> >
> > The event dequeueing is also a bit special to avoid losing events. Since
> > event merging only happens while the event is queued, there is a window
> > between when an error event is dequeued (notification_lock is dropped)
> > until it is reset (.free_event()) where the slot is full, but no merges
> > can happen.
> >
> > The proposed solution is to replace the event in the slot with a new
> > structure, prior to dropping the lock.  This way, if a new event arrives
> > in the time between the event was dequeued and the time it resets, the
> > new errors will still be logged and merged in the new slot.
> >
> > Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
>
> The splitting of the patches really helped. Now I think I can grok much
> more details than before :) Thanks! Some comments below.
>
> > diff --git a/fs/notify/fanotify/fanotify.c b/fs/notify/fanotify/fanotify.c
> > index 0678d35432a7..4e9e271a4394 100644
> > --- a/fs/notify/fanotify/fanotify.c
> > +++ b/fs/notify/fanotify/fanotify.c
> > @@ -681,6 +681,42 @@ static __kernel_fsid_t fanotify_get_fsid(struct fsnotify_iter_info *iter_info)
> >       return fsid;
> >  }
> >
> > +static int fanotify_merge_error_event(struct fsnotify_group *group,
> > +                                   struct fsnotify_event *event)
> > +{
> > +     struct fanotify_event *fae = FANOTIFY_E(event);
> > +     struct fanotify_error_event *fee = FANOTIFY_EE(fae);
> > +
> > +     /*
> > +      * When err_count > 0, the reporting slot is full.  Just account
> > +      * the additional error and abort the insertion.
> > +      */
> > +     if (fee->err_count) {
> > +             fee->err_count++;
> > +             return 1;
> > +     }
> > +
> > +     return 0;
> > +}
> > +
> > +static void fanotify_insert_error_event(struct fsnotify_group *group,
> > +                                     struct fsnotify_event *event,
> > +                                     const void *data)
> > +{
> > +     const struct fs_error_report *report = (struct fs_error_report *) data;
> > +     struct fanotify_event *fae = FANOTIFY_E(event);
> > +     struct fanotify_error_event *fee;
> > +
> > +     /* This might be an unexpected type of event (i.e. overflow). */
> > +     if (!fanotify_is_error_event(fae->mask))
> > +             return;
> > +
> > +     fee = FANOTIFY_EE(fae);
> > +     fee->fae.type = FANOTIFY_EVENT_TYPE_FS_ERROR;
> > +     fee->error = report->error;
> > +     fee->err_count = 1;
> > +}
> > +
> >  /*
> >   * Add an event to hash table for faster merge.
> >   */
> > @@ -735,7 +771,7 @@ static int fanotify_handle_event(struct fsnotify_group *group, u32 mask,
> >       BUILD_BUG_ON(FAN_OPEN_EXEC_PERM != FS_OPEN_EXEC_PERM);
> >       BUILD_BUG_ON(FAN_FS_ERROR != FS_ERROR);
> >
> > -     BUILD_BUG_ON(HWEIGHT32(ALL_FANOTIFY_EVENT_BITS) != 19);
> > +     BUILD_BUG_ON(HWEIGHT32(ALL_FANOTIFY_EVENT_BITS) != 20);
> >
> >       mask = fanotify_group_event_mask(group, iter_info, mask, data,
> >                                        data_type, dir);
> > @@ -760,6 +796,18 @@ static int fanotify_handle_event(struct fsnotify_group *group, u32 mask,
> >                       return 0;
> >       }
> >
> > +     if (fanotify_is_error_event(mask)) {
> > +             struct fanotify_sb_mark *sb_mark =
> > +                     FANOTIFY_SB_MARK(fsnotify_iter_sb_mark(iter_info));
> > +
> > +             ret = fsnotify_insert_event(group,
> > +                                         &sb_mark->fee_slot->fae.fse,
> > +                                         fanotify_merge_error_event,
> > +                                         fanotify_insert_error_event,
> > +                                         data);
> > +             goto finish;
> > +     }
>
> Hum, seeing this and how you had to extend fsnotify_add_event() to
> accommodate this use, cannot we instead have something like:
>
>         if (fanotify_is_error_event(mask)) {
>                 struct fanotify_sb_mark *sb_mark =
>                         FANOTIFY_SB_MARK(fsnotify_iter_sb_mark(iter_info));
>                 struct fanotify_error_event *event = &sb_mark->fee_slot;
>                 bool queue = false;
>
>                 spin_lock(&group->notification_lock);
>                 /* Not yet queued? */
>                 if (!event->err_count) {
>                         fee->error = report->error;
>                         queue = true;
>                 }
>                 event->err_count++;
>                 spin_unlock(&group->notification_lock);
>                 if (queue) {
>                         ... fill in other error info in 'event' such as fhandle
>                         fsnotify_add_event(group, &event->fae.fse, NULL);
>                 }
>         }
>
> It would be IMHO simpler to follow what's going on and we don't have to
> touch fsnotify_add_event(). I do recognize that due to races it may happen
> that some racing fsnotify(FAN_FS_ERROR) call returns before the event is
> actually visible in the event queue. It don't think it really matters but
> if we wanted to be more careful, we would need to preformat fhandle into a
> local buffer and only copy it into the event under notification_lock when
> we see the event is unused.
>
> > +/*
> > + * Replace a mark's error event with a new structure in preparation for
> > + * it to be dequeued.  This is a bit annoying since we need to drop the
> > + * lock, so another thread might just steal the event from us.
> > + */
> > +static int fanotify_replace_fs_error_event(struct fsnotify_group *group,
> > +                                        struct fanotify_event *fae)
> > +{
> > +     struct fanotify_error_event *new, *fee = FANOTIFY_EE(fae);
> > +     struct fanotify_sb_mark *sb_mark = fee->sb_mark;
> > +     struct fsnotify_event *fse;
> > +
> > +     pr_debug("%s: event=%p\n", __func__, fae);
> > +
> > +     assert_spin_locked(&group->notification_lock);
> > +
> > +     spin_unlock(&group->notification_lock);
> > +     new = fanotify_alloc_error_event(sb_mark);
> > +     spin_lock(&group->notification_lock);
> > +
> > +     if (!new)
> > +             return -ENOMEM;
> > +
> > +     /*
> > +      * Since we temporarily dropped the notification_lock, the event
> > +      * might have been taken from under us and reported by another
> > +      * reader.  If that is the case, don't play games, just retry.
> > +      */
> > +     fse = fsnotify_peek_first_event(group);
> > +     if (fse != &fae->fse) {
> > +             kfree(new);
> > +             return -EAGAIN;
> > +     }
> > +
> > +     sb_mark->fee_slot = new;
> > +
> > +     return 0;
> > +}
> > +
> >  /*
> >   * Get an fanotify notification event if one exists and is small
> >   * enough to fit in "count". Return an error pointer if the count
> > @@ -212,9 +252,21 @@ static struct fanotify_event *get_one_event(struct fsnotify_group *group,
> >               goto out;
> >       }
> >
> > +     if (fanotify_is_error_event(event->mask)) {
> > +             /*
> > +              * Replace the error event ahead of dequeueing so we
> > +              * don't need to handle a incorrectly dequeued event.
> > +              */
> > +             ret = fanotify_replace_fs_error_event(group, event);
> > +             if (ret) {
> > +                     event = ERR_PTR(ret);
> > +                     goto out;
> > +             }
> > +     }
> > +
>
> The replacing, retry, and all is hairy. Cannot we just keep the same event
> attached to the sb mark and copy-out to on-stack buffer under
> notification_lock in get_one_event()? The event is big (due to fhandle) but
> fanotify_read() is not called from a deep call chain so we should have
> enough space on stack for that.
>

For the record, this was one of the first implementations from Gabriel.
When I proposed the double buffer implementation it was either that
or go back to copy to stack.

Given the complications, I agree that going back to copy to stack
is preferred.

Thanks,
Amir.
Gabriel Krisman Bertazi Aug. 10, 2021, 1:35 a.m. UTC | #3
Jan Kara <jack@suse.cz> writes:
>> @@ -760,6 +796,18 @@ static int fanotify_handle_event(struct fsnotify_group *group, u32 mask,
>>  			return 0;
>>  	}
>>  
>> +	if (fanotify_is_error_event(mask)) {
>> +		struct fanotify_sb_mark *sb_mark =
>> +			FANOTIFY_SB_MARK(fsnotify_iter_sb_mark(iter_info));
>> +
>> +		ret = fsnotify_insert_event(group,
>> +					    &sb_mark->fee_slot->fae.fse,
>> +					    fanotify_merge_error_event,
>> +					    fanotify_insert_error_event,
>> +					    data);
>> +		goto finish;
>> +	}
>
> Hum, seeing this and how you had to extend fsnotify_add_event() to
> accommodate this use, cannot we instead have something like:
>
> 	if (fanotify_is_error_event(mask)) {
> 		struct fanotify_sb_mark *sb_mark =
> 			FANOTIFY_SB_MARK(fsnotify_iter_sb_mark(iter_info));
> 		struct fanotify_error_event *event = &sb_mark->fee_slot;
> 		bool queue = false;
>
> 		spin_lock(&group->notification_lock);
> 		/* Not yet queued? */
> 		if (!event->err_count) {
> 			fee->error = report->error;
> 			queue = true;
> 		}
> 		event->err_count++;
> 		spin_unlock(&group->notification_lock);
> 		if (queue) {
> 			... fill in other error info in 'event' such as fhandle
> 			fsnotify_add_event(group, &event->fae.fse, NULL);
> 		}
> 	}
>
> It would be IMHO simpler to follow what's going on and we don't have to
> touch fsnotify_add_event(). I do recognize that due to races it may happen
> that some racing fsnotify(FAN_FS_ERROR) call returns before the event is
> actually visible in the event queue. It don't think it really matters but
> if we wanted to be more careful, we would need to preformat fhandle into a
> local buffer and only copy it into the event under notification_lock when
> we see the event is unused.

Hi Jan,

This is actually similar to my first implementation too (like what
Amir said about the hunk below). It is a shame, cause I really like
the current version better, but the point about not doing the FH
encoding under the notification_lock makes a lot of sense.  I will
revert to the previous approach.

>> +/*
>> + * Replace a mark's error event with a new structure in preparation for
>> + * it to be dequeued.  This is a bit annoying since we need to drop the
>> + * lock, so another thread might just steal the event from us.
>> + */
>> +static int fanotify_replace_fs_error_event(struct fsnotify_group *group,
>> +					   struct fanotify_event *fae)
>> +{
>> +	struct fanotify_error_event *new, *fee = FANOTIFY_EE(fae);
>> +	struct fanotify_sb_mark *sb_mark = fee->sb_mark;
>> +	struct fsnotify_event *fse;
>> +
>> +	pr_debug("%s: event=%p\n", __func__, fae);
>> +
>> +	assert_spin_locked(&group->notification_lock);
>> +
>> +	spin_unlock(&group->notification_lock);
>> +	new = fanotify_alloc_error_event(sb_mark);
>> +	spin_lock(&group->notification_lock);
>> +
>> +	if (!new)
>> +		return -ENOMEM;
>> +
>> +	/*
>> +	 * Since we temporarily dropped the notification_lock, the event
>> +	 * might have been taken from under us and reported by another
>> +	 * reader.  If that is the case, don't play games, just retry.
>> +	 */
>> +	fse = fsnotify_peek_first_event(group);
>> +	if (fse != &fae->fse) {
>> +		kfree(new);
>> +		return -EAGAIN;
>> +	}
>> +
>> +	sb_mark->fee_slot = new;
>> +
>> +	return 0;
>> +}
>> +
>>  /*
>>   * Get an fanotify notification event if one exists and is small
>>   * enough to fit in "count". Return an error pointer if the count
>> @@ -212,9 +252,21 @@ static struct fanotify_event *get_one_event(struct fsnotify_group *group,
>>  		goto out;
>>  	}
>>  
>> +	if (fanotify_is_error_event(event->mask)) {
>> +		/*
>> +		 * Replace the error event ahead of dequeueing so we
>> +		 * don't need to handle a incorrectly dequeued event.
>> +		 */
>> +		ret = fanotify_replace_fs_error_event(group, event);
>> +		if (ret) {
>> +			event = ERR_PTR(ret);
>> +			goto out;
>> +		}
>> +	}
>> +
> The replacing, retry, and all is hairy. Cannot we just keep the same event
> attached to the sb mark and copy-out to on-stack buffer under
> notification_lock in get_one_event()? The event is big (due to fhandle) but
> fanotify_read() is not called from a deep call chain so we should have
> enough space on stack for that.
diff mbox series

Patch

diff --git a/fs/notify/fanotify/fanotify.c b/fs/notify/fanotify/fanotify.c
index 0678d35432a7..4e9e271a4394 100644
--- a/fs/notify/fanotify/fanotify.c
+++ b/fs/notify/fanotify/fanotify.c
@@ -681,6 +681,42 @@  static __kernel_fsid_t fanotify_get_fsid(struct fsnotify_iter_info *iter_info)
 	return fsid;
 }
 
+static int fanotify_merge_error_event(struct fsnotify_group *group,
+				      struct fsnotify_event *event)
+{
+	struct fanotify_event *fae = FANOTIFY_E(event);
+	struct fanotify_error_event *fee = FANOTIFY_EE(fae);
+
+	/*
+	 * When err_count > 0, the reporting slot is full.  Just account
+	 * the additional error and abort the insertion.
+	 */
+	if (fee->err_count) {
+		fee->err_count++;
+		return 1;
+	}
+
+	return 0;
+}
+
+static void fanotify_insert_error_event(struct fsnotify_group *group,
+					struct fsnotify_event *event,
+					const void *data)
+{
+	const struct fs_error_report *report = (struct fs_error_report *) data;
+	struct fanotify_event *fae = FANOTIFY_E(event);
+	struct fanotify_error_event *fee;
+
+	/* This might be an unexpected type of event (i.e. overflow). */
+	if (!fanotify_is_error_event(fae->mask))
+		return;
+
+	fee = FANOTIFY_EE(fae);
+	fee->fae.type = FANOTIFY_EVENT_TYPE_FS_ERROR;
+	fee->error = report->error;
+	fee->err_count = 1;
+}
+
 /*
  * Add an event to hash table for faster merge.
  */
@@ -735,7 +771,7 @@  static int fanotify_handle_event(struct fsnotify_group *group, u32 mask,
 	BUILD_BUG_ON(FAN_OPEN_EXEC_PERM != FS_OPEN_EXEC_PERM);
 	BUILD_BUG_ON(FAN_FS_ERROR != FS_ERROR);
 
-	BUILD_BUG_ON(HWEIGHT32(ALL_FANOTIFY_EVENT_BITS) != 19);
+	BUILD_BUG_ON(HWEIGHT32(ALL_FANOTIFY_EVENT_BITS) != 20);
 
 	mask = fanotify_group_event_mask(group, iter_info, mask, data,
 					 data_type, dir);
@@ -760,6 +796,18 @@  static int fanotify_handle_event(struct fsnotify_group *group, u32 mask,
 			return 0;
 	}
 
+	if (fanotify_is_error_event(mask)) {
+		struct fanotify_sb_mark *sb_mark =
+			FANOTIFY_SB_MARK(fsnotify_iter_sb_mark(iter_info));
+
+		ret = fsnotify_insert_event(group,
+					    &sb_mark->fee_slot->fae.fse,
+					    fanotify_merge_error_event,
+					    fanotify_insert_error_event,
+					    data);
+		goto finish;
+	}
+
 	event = fanotify_alloc_event(group, mask, data, data_type, dir,
 				     file_name, &fsid);
 	ret = -ENOMEM;
diff --git a/fs/notify/fanotify/fanotify.h b/fs/notify/fanotify/fanotify.h
index 206dc6cfd671..8929ea50f96f 100644
--- a/fs/notify/fanotify/fanotify.h
+++ b/fs/notify/fanotify/fanotify.h
@@ -223,6 +223,9 @@  FANOTIFY_NE(struct fanotify_event *event)
 
 struct fanotify_error_event {
 	struct fanotify_event fae;
+	s32 error; /* Error reported by the Filesystem. */
+	u32 err_count; /* Suppressed errors count */
+
 	struct fanotify_sb_mark *sb_mark; /* Back reference to the mark. */
 };
 
@@ -323,6 +326,11 @@  static inline struct fanotify_event *FANOTIFY_E(struct fsnotify_event *fse)
 	return container_of(fse, struct fanotify_event, fse);
 }
 
+static inline bool fanotify_is_error_event(u32 mask)
+{
+	return mask & FAN_FS_ERROR;
+}
+
 static inline bool fanotify_event_has_path(struct fanotify_event *event)
 {
 	return event->type == FANOTIFY_EVENT_TYPE_PATH ||
@@ -352,6 +360,7 @@  static inline struct path *fanotify_event_path(struct fanotify_event *event)
 static inline bool fanotify_is_hashed_event(u32 mask)
 {
 	return !(fanotify_is_perm_event(mask) ||
+		 fanotify_is_error_event(mask) ||
 		 fsnotify_is_overflow_event(mask));
 }
 
diff --git a/fs/notify/fanotify/fanotify_user.c b/fs/notify/fanotify/fanotify_user.c
index 76c1c805af3d..e7fe6bc61b6f 100644
--- a/fs/notify/fanotify/fanotify_user.c
+++ b/fs/notify/fanotify/fanotify_user.c
@@ -183,6 +183,45 @@  static struct fanotify_error_event *fanotify_alloc_error_event(
 	return fee;
 }
 
+/*
+ * Replace a mark's error event with a new structure in preparation for
+ * it to be dequeued.  This is a bit annoying since we need to drop the
+ * lock, so another thread might just steal the event from us.
+ */
+static int fanotify_replace_fs_error_event(struct fsnotify_group *group,
+					   struct fanotify_event *fae)
+{
+	struct fanotify_error_event *new, *fee = FANOTIFY_EE(fae);
+	struct fanotify_sb_mark *sb_mark = fee->sb_mark;
+	struct fsnotify_event *fse;
+
+	pr_debug("%s: event=%p\n", __func__, fae);
+
+	assert_spin_locked(&group->notification_lock);
+
+	spin_unlock(&group->notification_lock);
+	new = fanotify_alloc_error_event(sb_mark);
+	spin_lock(&group->notification_lock);
+
+	if (!new)
+		return -ENOMEM;
+
+	/*
+	 * Since we temporarily dropped the notification_lock, the event
+	 * might have been taken from under us and reported by another
+	 * reader.  If that is the case, don't play games, just retry.
+	 */
+	fse = fsnotify_peek_first_event(group);
+	if (fse != &fae->fse) {
+		kfree(new);
+		return -EAGAIN;
+	}
+
+	sb_mark->fee_slot = new;
+
+	return 0;
+}
+
 /*
  * Get an fanotify notification event if one exists and is small
  * enough to fit in "count". Return an error pointer if the count
@@ -196,6 +235,7 @@  static struct fanotify_event *get_one_event(struct fsnotify_group *group,
 	struct fanotify_event *event = NULL;
 	struct fsnotify_event *fsn_event;
 	unsigned int fid_mode = FAN_GROUP_FLAG(group, FANOTIFY_FID_BITS);
+	int ret;
 
 	pr_debug("%s: group=%p count=%zd\n", __func__, group, count);
 
@@ -212,9 +252,21 @@  static struct fanotify_event *get_one_event(struct fsnotify_group *group,
 		goto out;
 	}
 
+	if (fanotify_is_error_event(event->mask)) {
+		/*
+		 * Replace the error event ahead of dequeueing so we
+		 * don't need to handle a incorrectly dequeued event.
+		 */
+		ret = fanotify_replace_fs_error_event(group, event);
+		if (ret) {
+			event = ERR_PTR(ret);
+			goto out;
+		}
+	}
+
 	/*
-	 * Held the notification_lock the whole time, so this is the
-	 * same event we peeked above.
+	 * Even though we might have temporarily dropped the lock, this
+	 * is guaranteed to be the same event we peeked above.
 	 */
 	fsnotify_remove_first_event(group);
 	if (fanotify_is_perm_event(event->mask))
@@ -596,6 +648,8 @@  static ssize_t fanotify_read(struct file *file, char __user *buf,
 		event = get_one_event(group, count);
 		if (IS_ERR(event)) {
 			ret = PTR_ERR(event);
+			if (ret == -EAGAIN)
+				continue;
 			break;
 		}
 
@@ -1464,6 +1518,11 @@  static int do_fanotify_mark(int fanotify_fd, unsigned int flags, __u64 mask,
 		fsid = &__fsid;
 	}
 
+	if (mask & FAN_FS_ERROR && mark_type != FAN_MARK_FILESYSTEM) {
+		ret = -EINVAL;
+		goto path_put_and_out;
+	}
+
 	/* inode held in place by reference to path; group by fget on fd */
 	if (mark_type == FAN_MARK_INODE)
 		inode = path.dentry->d_inode;
diff --git a/include/linux/fanotify.h b/include/linux/fanotify.h
index c05d45bde8b8..c4d49308b2d0 100644
--- a/include/linux/fanotify.h
+++ b/include/linux/fanotify.h
@@ -88,9 +88,13 @@  extern struct ctl_table fanotify_table[]; /* for sysctl */
 #define FANOTIFY_INODE_EVENTS	(FANOTIFY_DIRENT_EVENTS | \
 				 FAN_ATTRIB | FAN_MOVE_SELF | FAN_DELETE_SELF)
 
+/* Events that can only be reported with data type FSNOTIFY_EVENT_ERROR */
+#define FANOTIFY_ERROR_EVENTS	(FAN_FS_ERROR)
+
 /* Events that user can request to be notified on */
 #define FANOTIFY_EVENTS		(FANOTIFY_PATH_EVENTS | \
-				 FANOTIFY_INODE_EVENTS)
+				 FANOTIFY_INODE_EVENTS | \
+				 FANOTIFY_ERROR_EVENTS)
 
 /* Events that require a permission response from user */
 #define FANOTIFY_PERM_EVENTS	(FAN_OPEN_PERM | FAN_ACCESS_PERM | \