diff mbox series

[10/18] fsinfo: Provide notification overrun handling support [ver #21]

Message ID 159646187082.1784947.4293611877413578847.stgit@warthog.procyon.org.uk (mailing list archive)
State New, archived
Headers show
Series VFS: Filesystem information [ver #21] | expand

Commit Message

David Howells Aug. 3, 2020, 1:37 p.m. UTC
Provide support for the handling of an overrun in a watch queue.  In the
event that an overrun occurs, the watcher needs to be able to find out what
it was that they missed.  To this end, previous patches added event
counters to struct mount.

To make them accessible, they can be retrieved using fsinfo() and the
FSINFO_ATTR_MOUNT_INFO attribute.

	struct fsinfo_mount_info {
		__u64	mnt_unique_id;
		__u64	mnt_attr_changes;
		__u64	mnt_topology_changes;
		__u64	mnt_subtree_notifications;
	...
	};

There's a uniquifier and some event counters:

 (1) mnt_unique_id - This is an effectively non-repeating ID given to each
     mount object on creation.  This allows the caller to check that the
     mount ID didn't get reused (the 32-bit mount ID is more efficient to
     look up).

 (2) mnt_attr_changes - Count of attribute changes on a mount object.

 (3) mnt_topology_changes - Count of alterations to the mount tree that
     affected this node.

 (4) mnt_subtree_notifications - Count of mount object event notifications
     that were generated in the subtree rooted at this node.  This excludes
     events generated on this node itself and does not include superblock
     events.

The counters are also accessible through the FSINFO_ATTR_MOUNT_CHILDREN
attribute, where a list of all the children of a mount can be scanned.  The
record returned for each child includes the sum of the counters for that
child.  An additional record is added at the end for the queried object and
that also includes the sum of its counters

The mnt_topology_changes counter is also included in
FSINFO_ATTR_MOUNT_TOPOLOGY.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 fs/mount_notify.c           |    2 ++
 fs/namespace.c              |   21 +++++++++++++++++++++
 include/uapi/linux/fsinfo.h |    7 +++++++
 samples/vfs/test-fsinfo.c   |   10 ++++++++--
 4 files changed, 38 insertions(+), 2 deletions(-)

Comments

Miklos Szeredi Aug. 4, 2020, 1:56 p.m. UTC | #1
On Mon, Aug 03, 2020 at 02:37:50PM +0100, David Howells wrote:
> Provide support for the handling of an overrun in a watch queue.  In the
> event that an overrun occurs, the watcher needs to be able to find out what
> it was that they missed.  To this end, previous patches added event
> counters to struct mount.

So this is optimizing the buffer overrun case?

Shoun't we just make sure that the likelyhood of overruns is low and if it
happens, just reinitialize everthing from scratch (shouldn't be *that*
expensive).

Trying to find out what was missed seems like just adding complexity for no good
reason.
Ian Kent Aug. 5, 2020, 2:05 a.m. UTC | #2
On Tue, 2020-08-04 at 15:56 +0200, Miklos Szeredi wrote:
> On Mon, Aug 03, 2020 at 02:37:50PM +0100, David Howells wrote:
> > Provide support for the handling of an overrun in a watch
> > queue.  In the
> > event that an overrun occurs, the watcher needs to be able to find
> > out what
> > it was that they missed.  To this end, previous patches added event
> > counters to struct mount.
> 
> So this is optimizing the buffer overrun case?
> 
> Shoun't we just make sure that the likelyhood of overruns is low and
> if it
> happens, just reinitialize everthing from scratch (shouldn't be
> *that*
> expensive).

But maybe not possible if you are using notifications for tracking
state in user space, you need to know when the thing you have needs
to be synced because you missed something and it's during the
notification processing you actually have the object that may need
to be refreshed.

> 
> Trying to find out what was missed seems like just adding complexity
> for no good
> reason.
>
Ian Kent Aug. 5, 2020, 2:46 a.m. UTC | #3
On Wed, 2020-08-05 at 10:05 +0800, Ian Kent wrote:
> On Tue, 2020-08-04 at 15:56 +0200, Miklos Szeredi wrote:
> > On Mon, Aug 03, 2020 at 02:37:50PM +0100, David Howells wrote:
> > > Provide support for the handling of an overrun in a watch
> > > queue.  In the
> > > event that an overrun occurs, the watcher needs to be able to
> > > find
> > > out what
> > > it was that they missed.  To this end, previous patches added
> > > event
> > > counters to struct mount.
> > 
> > So this is optimizing the buffer overrun case?
> > 
> > Shoun't we just make sure that the likelyhood of overruns is low
> > and
> > if it
> > happens, just reinitialize everthing from scratch (shouldn't be
> > *that*
> > expensive).
> 
> But maybe not possible if you are using notifications for tracking
> state in user space, you need to know when the thing you have needs
> to be synced because you missed something and it's during the
> notification processing you actually have the object that may need
> to be refreshed.
> 
> > Trying to find out what was missed seems like just adding
> > complexity
> > for no good
> > reason.

Coming back to an actual use case.

What I said above is one aspect but, since I'm looking at this right
now with systemd, and I do have the legacy code to fall back to, the
"just reset everything" suggestion does make sense.

But I'm struggling to see how I can identify notification buffer
overrun in libmount, and overrun is just one possibility for lost
notifications, so I like the idea that, as a library user, I can
work out that I need to take action based on what I have in the
notifications themselves.

Ian
Miklos Szeredi Aug. 5, 2020, 7:45 a.m. UTC | #4
On Wed, Aug 5, 2020 at 4:46 AM Ian Kent <raven@themaw.net> wrote:
>

> Coming back to an actual use case.
>
> What I said above is one aspect but, since I'm looking at this right
> now with systemd, and I do have the legacy code to fall back to, the
> "just reset everything" suggestion does make sense.
>
> But I'm struggling to see how I can identify notification buffer
> overrun in libmount, and overrun is just one possibility for lost
> notifications, so I like the idea that, as a library user, I can
> work out that I need to take action based on what I have in the
> notifications themselves.

Hmm, what's the other possibility for lost notifications?

Thanks,
Miklos
Ian Kent Aug. 5, 2020, 11:23 a.m. UTC | #5
On Wed, 2020-08-05 at 09:45 +0200, Miklos Szeredi wrote:
> On Wed, Aug 5, 2020 at 4:46 AM Ian Kent <raven@themaw.net> wrote:
> > Coming back to an actual use case.
> > 
> > What I said above is one aspect but, since I'm looking at this
> > right
> > now with systemd, and I do have the legacy code to fall back to,
> > the
> > "just reset everything" suggestion does make sense.
> > 
> > But I'm struggling to see how I can identify notification buffer
> > overrun in libmount, and overrun is just one possibility for lost
> > notifications, so I like the idea that, as a library user, I can
> > work out that I need to take action based on what I have in the
> > notifications themselves.
> 
> Hmm, what's the other possibility for lost notifications?

In user space that is:

Multi-threaded application races, single threaded applications and
signal processing races, other bugs ...

For example systemd has it's own event handling sub-system and handles
half a dozen or so event types of which the mount changes are one. It's
fairly complex so I find myself wondering if I can trust it and
wondering if there are undiscovered bugs in it. The answer to the
former is probably yes but the answer to the later is also probably
yes.

Maybe I just paranoid!
Ian
Miklos Szeredi Aug. 5, 2020, 11:27 a.m. UTC | #6
On Wed, Aug 5, 2020 at 1:23 PM Ian Kent <raven@themaw.net> wrote:
>
> On Wed, 2020-08-05 at 09:45 +0200, Miklos Szeredi wrote:

> > Hmm, what's the other possibility for lost notifications?
>
> In user space that is:
>
> Multi-threaded application races, single threaded applications and
> signal processing races, other bugs ...

Okay, let's fix the bugs then.

Thanks,
Miklos
David Howells Aug. 5, 2020, 4:06 p.m. UTC | #7
Miklos Szeredi <miklos@szeredi.hu> wrote:

> Shoun't we just make sure that the likelyhood of overruns is low

That's not necessarily easy.  To avoid overruns you need a bigger buffer.  The
buffer is preallocated from unswappable kernel space.  Yes, you can increase
the size of the buffer, but it eats out of your pipe bufferage limit.

Further, it's a *general* notifications queue, not just for a specific
purpose, but that means it might get connected to multiple sources, and doing
something like tearing down a container might generate enough notifications to
overrun the queue.

> and if it happens, just reinitialize everthing from scratch (shouldn't be
> *that* expensive).

If you then spend time reinitialising everything, you're increasing the
likelihood of racing with further events.  Further, there multiple expenses:
firstly, you have to tear down and discard all the data that you've spent time
setting up; secondly, it takes time doing all this; thirdly, it takes cpu
cycles away from applications.

The reason I put the event counters in there and made it so that fsinfo()
could read all the mounts in a subtree and their event counters in one go is
to make it faster for the user to find out what changed in the event that a
notification is lost.

I have a patch (not included here as it occasionally induces oopses) that
attempts to make this doable under the RCU read lock so that it doesn't
prevent mounts from taking place during the scan.

David
Miklos Szeredi Aug. 5, 2020, 5:26 p.m. UTC | #8
On Wed, Aug 5, 2020 at 6:07 PM David Howells <dhowells@redhat.com> wrote:
>
> Miklos Szeredi <miklos@szeredi.hu> wrote:
>
> > Shoun't we just make sure that the likelyhood of overruns is low
>
> That's not necessarily easy.  To avoid overruns you need a bigger buffer.  The
> buffer is preallocated from unswappable kernel space.  Yes, you can increase
> the size of the buffer, but it eats out of your pipe bufferage limit.
>
> Further, it's a *general* notifications queue, not just for a specific
> purpose, but that means it might get connected to multiple sources, and doing
> something like tearing down a container might generate enough notifications to
> overrun the queue.
>
> > and if it happens, just reinitialize everthing from scratch (shouldn't be
> > *that* expensive).
>
> If you then spend time reinitialising everything, you're increasing the
> likelihood of racing with further events.  Further, there multiple expenses:
> firstly, you have to tear down and discard all the data that you've spent time
> setting up; secondly, it takes time doing all this; thirdly, it takes cpu
> cycles away from applications.
>
> The reason I put the event counters in there and made it so that fsinfo()
> could read all the mounts in a subtree and their event counters in one go is
> to make it faster for the user to find out what changed in the event that a
> notification is lost.

That's just overdesigning it, IMO.

If the protocol is extensible (as you state) then the counters can be
added as needed.  And unless the above CPU cycle wastage is actually
observed in practice, the whole thing is unnecessary.

Thanks,
Miklos
Ian Kent Aug. 6, 2020, 1:47 a.m. UTC | #9
On Wed, 2020-08-05 at 13:27 +0200, Miklos Szeredi wrote:
> On Wed, Aug 5, 2020 at 1:23 PM Ian Kent <raven@themaw.net> wrote:
> > On Wed, 2020-08-05 at 09:45 +0200, Miklos Szeredi wrote:
> > > Hmm, what's the other possibility for lost notifications?
> > 
> > In user space that is:
> > 
> > Multi-threaded application races, single threaded applications and
> > signal processing races, other bugs ...
> 
> Okay, let's fix the bugs then.

It's the the bugs you don't know about that get you, in this case
the world "is" actually out to get you, ;)

Ian
diff mbox series

Patch

diff --git a/fs/mount_notify.c b/fs/mount_notify.c
index 57eebae51cb1..57995c27ca88 100644
--- a/fs/mount_notify.c
+++ b/fs/mount_notify.c
@@ -93,6 +93,8 @@  void notify_mount(struct mount *trigger,
 	n.watch.info	= info_flags | watch_sizeof(n);
 	n.triggered_on	= trigger->mnt_unique_id;
 
+	smp_wmb(); /* See fsinfo_generic_mount_info(). */
+
 	switch (subtype) {
 	case NOTIFY_MOUNT_EXPIRY:
 	case NOTIFY_MOUNT_READONLY:
diff --git a/fs/namespace.c b/fs/namespace.c
index b5c2a3b4f96d..122c12f9512b 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -4282,6 +4282,17 @@  int fsinfo_generic_mount_info(struct path *path, struct fsinfo_context *ctx)
 	p->mnt_unique_id	= m->mnt_unique_id;
 	p->mnt_id		= m->mnt_id;
 
+#ifdef CONFIG_MOUNT_NOTIFICATIONS
+	p->mnt_subtree_notifications = atomic_long_read(&m->mnt_subtree_notifications);
+	p->mnt_topology_changes	= atomic_long_read(&m->mnt_topology_changes);
+	p->mnt_attr_changes	= atomic_long_read(&m->mnt_attr_changes);
+#endif
+
+	/* Record the counters before reading the attributes as we're not
+	 * holding a lock.  Paired with a write barrier in notify_mount().
+	 */
+	smp_rmb();
+
 	flags = READ_ONCE(m->mnt.mnt_flags);
 	if (flags & MNT_READONLY)
 		p->attr |= MOUNT_ATTR_RDONLY;
@@ -4319,6 +4330,9 @@  int fsinfo_generic_mount_topology(struct path *path, struct fsinfo_context *ctx)
 
 	m = real_mount(path->mnt);
 
+#ifdef CONFIG_MOUNT_NOTIFICATIONS
+	p->mnt_topology_changes	= atomic_long_read(&m->mnt_topology_changes);
+#endif
 	p->parent_id = m->mnt_parent->mnt_id;
 
 	if (path->mnt == root.mnt) {
@@ -4445,6 +4459,13 @@  static void fsinfo_store_mount(struct fsinfo_context *ctx, const struct mount *p
 	record.mnt_unique_id	= p->mnt_unique_id;
 	record.mnt_id		= p->mnt_id;
 	record.parent_id	= is_root ? p->mnt_id : p->mnt_parent->mnt_id;
+
+#ifdef CONFIG_MOUNT_NOTIFICATIONS
+	record.mnt_notify_sum	= (atomic_long_read(&p->mnt_attr_changes) +
+				   atomic_long_read(&p->mnt_topology_changes) +
+				   atomic_long_read(&p->mnt_subtree_notifications));
+#endif
+
 	memcpy(ctx->buffer + usage, &record, sizeof(record));
 }
 
diff --git a/include/uapi/linux/fsinfo.h b/include/uapi/linux/fsinfo.h
index f0a352b7028e..b021466dee0f 100644
--- a/include/uapi/linux/fsinfo.h
+++ b/include/uapi/linux/fsinfo.h
@@ -100,6 +100,9 @@  struct fsinfo_mount_info {
 	__u64	mnt_unique_id;		/* Kernel-lifetime unique mount ID */
 	__u32	mnt_id;			/* Mount identifier (use with AT_FSINFO_MOUNTID_PATH) */
 	__u32	attr;			/* MOUNT_ATTR_* flags */
+	__u64	mnt_attr_changes;	/* Number of attribute changes to this mount. */
+	__u64	mnt_topology_changes;	/* Number of topology changes to this mount. */
+	__u64	mnt_subtree_notifications; /* Number of notifications in mount subtree */
 };
 
 #define FSINFO_ATTR_MOUNT_INFO__STRUCT struct fsinfo_mount_info
@@ -113,6 +116,7 @@  struct fsinfo_mount_topology {
 	__u32	dependent_source_id;	/* Dependent: source mount group ID */
 	__u32	dependent_clone_of_id;	/* Dependent: ID of mount this was cloned from */
 	__u32	propagation_type;	/* MOUNT_PROPAGATION_* type */
+	__u64	mnt_topology_changes;	/* Number of topology changes to this mount. */
 };
 
 #define FSINFO_ATTR_MOUNT_TOPOLOGY__STRUCT struct fsinfo_mount_topology
@@ -125,6 +129,9 @@  struct fsinfo_mount_child {
 	__u64	mnt_unique_id;		/* Kernel-lifetime unique mount ID */
 	__u32	mnt_id;			/* Mount identifier (use with AT_FSINFO_MOUNTID_PATH) */
 	__u32	parent_id;		/* Parent mount identifier */
+	__u64	mnt_notify_sum;		/* Sum of mnt_attr_changes, mnt_topology_changes and
+					 * mnt_subtree_notifications.
+					 */
 };
 
 #define FSINFO_ATTR_MOUNT_CHILDREN__STRUCT struct fsinfo_mount_child
diff --git a/samples/vfs/test-fsinfo.c b/samples/vfs/test-fsinfo.c
index b7290ea8eb55..620a02477aa8 100644
--- a/samples/vfs/test-fsinfo.c
+++ b/samples/vfs/test-fsinfo.c
@@ -304,6 +304,10 @@  static void dump_fsinfo_generic_mount_info(void *reply, unsigned int size)
 	printf("\tmnt_uniq: %llx\n", (unsigned long long)r->mnt_unique_id);
 	printf("\tmnt_id  : %x\n", r->mnt_id);
 	printf("\tattr    : %x\n", r->attr);
+	printf("\tevents  : attr=%llu topology=%llu subtree=%llu\n",
+	       (unsigned long long)r->mnt_attr_changes,
+	       (unsigned long long)r->mnt_topology_changes,
+	       (unsigned long long)r->mnt_subtree_notifications);
 }
 
 static void dump_fsinfo_generic_mount_topology(void *reply, unsigned int size)
@@ -332,6 +336,7 @@  static void dump_fsinfo_generic_mount_topology(void *reply, unsigned int size)
 		break;
 	}
 
+	printf("\tevents  : topology=%llu\n", (unsigned long long)r->mnt_topology_changes);
 }
 
 static void dump_fsinfo_generic_mount_children(void *reply, unsigned int size)
@@ -354,8 +359,9 @@  static void dump_fsinfo_generic_mount_children(void *reply, unsigned int size)
 		mp = "<this>";
 	}
 
-	printf("%8x %16llx %s\n",
-	       r->mnt_id, (unsigned long long)r->mnt_unique_id, mp);
+	printf("%8x %16llx %10llu %s\n",
+	       r->mnt_id, (unsigned long long)r->mnt_unique_id,
+	       (unsigned long long)r->mnt_notify_sum, mp);
 }
 
 static void dump_string(void *reply, unsigned int size)